I built an AI code reviewer. Here's how I measured if it works.

Everyone is building AI-powered developer tools in 2026. Most of them work like this: take a diff, pipe it into GPT-4, return the result. Ship it. Call it "AI-powered code review."

I built Sentinel — a GitHub App that reviews pull requests — and spent more time on the evaluation harness than on the reviewer itself. Here's why, and what I learned.

The eval problem nobody talks about

When I first built the review pipeline, the output looked great. Claude generated thoughtful, specific comments about security issues, potential bugs, and style violations. I showed it to three engineers. They were impressed.

Then I ran it on 100 PRs from real open-source repos and compared the output to what senior engineers actually flagged. The security category had a precision of 0.41. Nearly 60% of its "security findings" were false positives.

Without the eval harness, I would have shipped it thinking it was great. With it, I knew exactly what to fix.

How the eval harness works

The core idea is simple: treat code review quality like a classification problem. Each review comment has a category (security, bug, performance, style) and a severity. Each labeled PR has ground-truth comments that a senior engineer would actually flag.

// Matching logic (simplified)
for each generated_comment:
  match = find_label where:
    same file (exact)
    same line (± 5 lines)
    same category (exact)
  if match: true_positive++
  else: false_positive++

The harness computes precision, recall, and F1 per category. It runs in CI on every push. If any category's F1 drops more than 5% from baseline, the build fails.

What I learned

Three things surprised me. First, hybrid retrieval (BM25 + vector) dramatically outperformed pure vector search for code — exact identifiers like function names matter more than semantic similarity. Second, the model's stated confidence correlates weakly with actual correctness — calibration is a real problem. Third, honestly reporting mediocre metrics is more impressive in interviews than claiming perfection.