Sentinel

AI-powered code review assistant for GitHub pull requests. Differentiated by a reproducible evaluation harness — 100 hand-labeled PRs with precision/recall/F1 per comment category, regression-gated in CI.

live demo ↗github ↗blog post ↗demo video ↗

The problem

Every AI portfolio project in 2026 wraps an LLM with LangChain and calls it done. What's missing — and what production AI teams actually care about — is evaluation. How do you know your system works? How do you catch regressions when you change a prompt? Sentinel exists to answer those questions with numbers, not vibes.

Architecture

Key decisions

Decision	Choice	Why
Retrieval	Hybrid BM25 + dense	Pure vector misses exact identifiers. Hybrid is state of the art for code search.
LLM layer	Custom gateway, not LangChain	200-line gateway with retries, cost tracking, and fallback. More debuggable and impressive.
Eval dataset	Hand-labeled, not LLM-labeled	Avoids the echo chamber. 100 PRs from 5 OSS repos, labeled by hand.
Output format	Pydantic + JSON mode	Type-safe structured output enables automated eval scoring.

Results

SECURITY F1

0.62

across 100 labeled PRs

AVG LATENCY

8.3s

per review

DAILY COST

$1.40

at 10 PRs/day