Everyone is building AI-powered developer tools in 2026. Most of them work like this: take a diff, pipe it into GPT-4, return the result. Ship it. Call it "AI-powered code review."
I built Sentinel — a GitHub App that reviews pull requests — and spent more time on the evaluation harness than on the reviewer itself. Here's why, and what I learned.
The eval problem nobody talks about
When I first built the review pipeline, the output looked great. Claude generated thoughtful, specific comments about security issues, potential bugs, and style violations. I showed it to three engineers. They were impressed.
Then I ran it on 100 PRs from real open-source repos and compared the output to what senior engineers actually flagged. The security category had a precision of 0.41. Nearly 60% of its "security findings" were false positives.
Without the eval harness, I would have shipped it thinking it was great. With it, I knew exactly what to fix.
How the eval harness works
The core idea is simple: treat code review quality like a classification problem. Each review comment has a category (security, bug, performance, style) and a severity. Each labeled PR has ground-truth comments that a senior engineer would actually flag.
// Matching logic (simplified)
for each generated_comment:
match = find_label where:
same file (exact)
same line (± 5 lines)
same category (exact)
if match: true_positive++
else: false_positive++The harness computes precision, recall, and F1 per category. It runs in CI on every push. If any category's F1 drops more than 5% from baseline, the build fails.
What I learned
Three things surprised me. First, hybrid retrieval (BM25 + vector) dramatically outperformed pure vector search for code — exact identifiers like function names matter more than semantic similarity. Second, the model's stated confidence correlates weakly with actual correctness — calibration is a real problem. Third, honestly reporting mediocre metrics is more impressive in interviews than claiming perfection.