LLM-as-Judge: Using AI to Grade AI
Hamel Husain's critique shadowing approach shows how to build an LLM judge that agrees with your domain expert 90%+ of the time. Here is the practical guide for AI product teams.
TL;DR
- LLM-as-judge lets you scale evaluation from dozens of human-reviewed interactions to thousands of automated assessments, but only when the judge is calibrated against domain expert judgment first
- Hamel Husain’s critique shadowing approach builds agreement between LLM and human graders iteratively, starting from a shared rubric and converging through disagreement analysis
- The judge becomes truly powerful when it understands the user via self-models, because the same response can be excellent for one user and terrible for another
LLM-as-judge is a method for scaling AI evaluation from dozens of human-reviewed interactions to thousands of automated assessments by training a language model to grade like a domain expert. An uncalibrated LLM judge is worthless, but Hamel Husain’s critique shadowing approach [1] shows how to reach 90%+ agreement between the judge and human experts through systematic calibration. This post covers the three-round calibration process, how to build a reliable judge prompt, and why self-models unlock the fit dimension that standard judges miss.
The Critique Shadowing Process
Hamel Husain’s approach to building reliable LLM judges follows what I call critique shadowing: the LLM judge shadows the human expert, and you systematically close the gap between their assessments.
Round 1: Establish the Baseline. Have your domain expert grade 50 interactions using your rubric. Score each dimension, relevance, coherence, depth, fit, on a 1-5 scale. Then have the LLM judge grade the same 50 interactions using the same rubric. Compare the scores.
On round 1, expect 60-70% agreement. The LLM will be too generous on some dimensions, too harsh on others, and completely miss the point on a few. This is normal.
Round 2: Analyze Disagreements. For every interaction where the LLM and expert disagree, document why. Is the LLM ignoring context? Is it applying the wrong standard? Is it failing to distinguish between technically correct and actually helpful?
Use these disagreements to refine the judge’s prompt. Add specific examples of what a 3 versus a 4 looks like for each dimension. Include the expert’s reasoning for their scores.
Round 3: Reconverge. Run the refined judge on a new set of 50 expert-graded interactions. Agreement should be 85-90%. For the remaining disagreements, make a judgment call: is the LLM wrong, or is the expert’s criteria ambiguous? Sometimes the calibration process reveals that the rubric needs clarification, not the judge.
Round 1: Establish Baseline
Expert grades 50 interactions. LLM judge grades the same 50. Compare scores. Expect 60-70% agreement on first pass.
Round 2: Analyze Disagreements
Document why the LLM and expert disagree. Refine the judge prompt with specific examples and expert reasoning for each score level.
Round 3: Reconverge
Run refined judge on a new set of 50 expert-graded interactions. Target 85-90% agreement. Remaining gaps may indicate rubric ambiguity.
Uncalibrated LLM Judge
- ×Confidently assigns scores with no human anchor
- ×60-70% agreement with domain experts
- ×Misses context-dependent quality differences
- ×Cannot distinguish helpful from technically correct
- ×Automated vibes with better formatting
Calibrated LLM Judge
- ✓Trained against 50+ expert-graded examples
- ✓90%+ agreement with domain experts
- ✓Includes expert reasoning in its evaluations
- ✓Catches nuanced quality failures at scale
- ✓Scales expert judgment to every interaction
Building the Judge Prompt
The judge prompt is the most important artifact in your eval system. Here is the structure that consistently produces 90%+ agreement.
Section 1: Role and Context. Tell the LLM it is evaluating AI product outputs against a specific rubric. Provide product context: what the AI does, who uses it, what a successful interaction looks like.
Section 2: The Rubric. Include the full rubric with definitions for each score level on each dimension. Not just what a 5 looks like: what a 1, 2, 3, 4, and 5 look like. The more specific, the higher the agreement.
Section 3: Calibration Examples. Include 5-10 expert-graded examples with the expert’s reasoning. These anchor the judge’s understanding. The examples should span the score range: a clear 5, a clear 1, and several in the ambiguous middle where context matters.
Section 4: Output Format. Require structured output with a score for each dimension plus a reasoning explanation. The reasoning is essential: it lets you audit the judge’s logic and catch systematic errors.
Section 1: Role & Context
Tell the LLM it is evaluating AI product outputs. Provide product context: what the AI does, who uses it, what success looks like.
Section 2: The Rubric
Full rubric with definitions for every score level (1-5) on each dimension. The more specific the definitions, the higher the agreement.
Section 3: Calibration Examples
5-10 expert-graded examples with reasoning. Span the full score range: a clear 5, a clear 1, and several in the ambiguous middle.
Section 4: Output Format
Structured output with per-dimension scores plus reasoning explanation. The reasoning lets you audit the judge and catch systematic errors.
1// Build a calibrated LLM judge← Trained against expert judgment2const judge = await clarity.createJudge({3rubric: qualityRubric,4calibrationExamples: expertGradedSamples,5dimensions: ['relevance', 'coherence', 'depth', 'fit'],6});78// Grade an interaction with user context← Fit scoring needs the self-model9const evaluation = await judge.evaluate({10interaction: aiResponse,11userContext: await clarity.getSelfModel(userId),12explainReasoning: true,13});1415// Returns: { relevance: 4, coherence: 5, depth: 3, fit: 4,← Auditable reasoning16// reasoning: 'Fit scored 4: adapted technical depth for senior engineer...' }
The Self-Model Advantage for LLM Judges
Here is where most LLM-as-judge implementations hit a wall they do not see coming.
A standard LLM judge evaluates whether the output is generally good. But a recommendation that is perfect for a beginner is condescending to an expert. An explanation that delights a technical user confuses a business stakeholder. The same output deserves different scores depending on who received it.
Without user context, the judge cannot evaluate fit. And fit, as Hamel Husain acknowledges [2], is the dimension that most directly correlates with user satisfaction. You can have perfect relevance, coherence, and depth, and still lose the user because the output was not tailored to them.
Self-models solve this by providing the judge with the user’s context: their expertise level, their goals, their communication preferences, their domain knowledge. The judge can then evaluate whether the output matched the person, not just whether it matched the question.
This transforms the judge from a generic quality checker into a personalization quality checker: the eval that matters most for retention.
Common Judge Failures and Fixes
Positivity bias. LLM judges tend to grade generously. Fix: include examples of clearly poor outputs that received low expert scores. Anchor the low end of the scale explicitly.
Length bias. Longer responses tend to get higher scores, even when brevity is more appropriate. Fix: include calibration examples where the expert preferred the concise response and explain why.
Style over substance. The judge rewards well-formatted, articulate responses even when they miss the point. Fix: include examples where a technically correct but unhelpful response received a low score, with explicit reasoning about the difference between sounding good and being good.
Context blindness. The judge evaluates the response in isolation, ignoring the conversation history or user context. Fix: always include the full interaction context, previous messages, user profile, session intent, in the judge’s input.
Positivity Bias
Everything scores 4-5. Fix: include examples of clearly poor outputs that received low expert scores. Anchor the low end explicitly.
Length Bias
Verbose responses outscore concise ones. Fix: include calibration cases where the expert preferred brevity and explain why.
Style Over Substance
Pretty but wrong scores high. Fix: add technically-correct-but-unhelpful examples with explicit reasoning about the difference.
Context Blindness
Same score regardless of user. Fix: include user context via self-models so the judge can evaluate fit, not just generic quality.
| Judge Failure | Symptom | Fix |
|---|---|---|
| Positivity bias | Everything scores 4-5 | Anchor low scores with expert examples |
| Length bias | Verbose responses outscore concise ones | Include brevity-preferred calibration cases |
| Style over substance | Pretty but wrong scores high | Add technically-correct-but-unhelpful examples |
| Context blindness | Same score regardless of user | Include user context via self-models |
Trade-offs
LLM judges cost money to run. Evaluating every interaction with a judge model adds inference costs. Budget 10-20% of your production inference cost for evaluation. For most teams, this is worth it, but it is a real line item.
Judge quality degrades over time. As your product evolves, the judge’s calibration examples become stale. Plan for quarterly recalibration with fresh expert-graded examples.
Circular dependency risk. If the same model family judges itself, systematic blind spots pass undetected. Use a different model family for judging than for production. Or better yet, use self-model context to ground the evaluation in user reality rather than model opinion.
Expert disagreement is real. If your domain experts disagree with each other 20% of the time, your judge cannot exceed 80% agreement with any single expert. Measure inter-expert agreement first: it sets the ceiling for judge accuracy.
What to Do Next
-
Start with 50 expert-graded examples. Have your best domain expert grade 50 real interactions using a defined rubric. This dataset is the foundation of everything, your calibration set, your baseline agreement measurement, and your first audit trail of what good looks like.
-
Build and calibrate in three rounds. Use Hamel Husain’s eval framework [3] as your guide. Round 1: baseline agreement. Round 2: disagreement analysis and prompt refinement. Round 3: convergence check. Target 90%+ agreement before deploying the judge to production.
-
Add user context to the judge. Once your judge achieves 90% agreement on generic quality, extend it with self-model context for fit evaluation. This is the highest-leverage improvement you can make, because fit is the dimension that predicts retention. Clarity provides the self-model API that makes context-aware judging possible.
An uncalibrated LLM judge is just automated vibes. A calibrated one scales your best expert to every interaction. Build the judge.
References
- Hamel Husain’s critique shadowing approach
- Hamel Husain acknowledges
- Hamel Husain’s eval framework
- 2016 survey of 2,000 Americans by Reelgood and Learndipity Data Insights
- Product vs. Feature Teams
- only 1 in 26 unhappy customers actually complains
- cold start problem
- Qualtrics notes in their churn prediction framework
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →