The Grading Rubric Your AI Product Is Missing

Your English teacher had a rubric for every essay. Your AI product has vibes. How to apply Hamel Husain's 3-level eval framework to build the quality criteria your team never wrote down.

Robert Ta's Self-Model CEO & Co-Founder 847 beliefs

· March 3, 2026 · 7 min read

TL;DR

Most AI product teams have never written down what a good AI response looks like, they ship on vibes and discover quality problems only when users leave
Hamel Husain’s 3-level eval framework gives you the scaffolding: unit-level assertions, human review protocols, and system-level alignment measurement
A grading rubric turns AI quality from a feeling into a conversation your entire team can have, from the engineer debugging a prompt to the CEO asking about retention

The grading rubric most AI products are missing is a written definition of what good, acceptable, and poor look like across four dimensions: relevance, coherence, depth, and fit. Hamel Husain’s 3-level eval framework provides the scaffolding to turn that rubric into automated quality measurement at scale. This post covers how to apply the rubric at each eval level, why fit is the dimension that requires self-models, and how to close the gap between vibes-based quality and evidence-based quality.

of AI teams lack written quality criteria

levels in Hamel Husain's eval framework

of churn traced to unnamed quality gaps

faster improvement with explicit rubrics

The Teacher Had It Right

In Hamel Husain’s eval framework [1], he describes three levels of AI evaluation that every product team should implement. Level 1 is unit-level assertions: the equivalent of spell-checking your essay. Level 2 is human review: the teacher reading your paper. Level 3 is system-level measurement, whether your essays are improving over the semester.

Most AI teams stop at Level 1. They write a few test cases, check that the model does not hallucinate on a known set of inputs, and call it done. That is like checking for spelling errors and declaring the essay complete.

The rubric is what connects all three levels. It defines what you are measuring at each stage. Without it, your Level 1 assertions test arbitrary things, your Level 2 reviewers each grade against their own private criteria, and your Level 3 metrics measure the wrong outcomes.

Mrs. Chen’s rubric worked because it gave everyone the same language. The student knew what to aim for. The teacher knew what to look for. The parent knew what the grade meant. Your AI product needs the same shared vocabulary.

The Four Dimensions of AI Quality

After working with dozens of AI product teams and studying how Hamel Husain approaches evaluation [2], I have found that AI quality decomposes into four dimensions that translate directly from essay grading:

Relevance is whether the AI addressed what the user actually needed. Not what they typed: what they meant. A search engine that returns technically correct results for the wrong interpretation scores low on relevance. This maps to Hamel’s assertion-level checks: can you write a test that verifies the AI understood the intent?

Coherence is whether the output holds together across interactions. Does the AI contradict itself? Does it maintain context over a conversation? An assistant that recommends aggressive growth in one message and conservative caution in the next scores low on coherence.

Depth is whether the AI went beyond restating what the user already knew. Did it surface an insight? Did it connect dots the user had not connected? Depth is the hardest dimension to automate and the most valuable to users.

Fit is whether the output matched this specific user, their expertise level, their goals, their context. Generic advice to a domain expert scores low on fit. Technical jargon to a non-technical executive scores low on fit.

Relevance

Did the AI address what the user meant, not just what they typed? Maps to assertion-level checks in Hamel’s framework.

Coherence

Does the output hold together across interactions? Does the AI maintain context and avoid contradicting itself?

Depth

Did the AI surface an insight the user had not connected? Hardest dimension to automate, most valuable to users.

Fit

Did the output match this specific user’s expertise, goals, and context? Requires a self-model to measure properly.

Quality Without a Rubric

×PM says the AI feels off
×Engineer says evals pass
×CEO asks if quality is improving
×Nobody can answer with specifics
×Team debates opinions, not evidence

Quality With a Rubric

✓PM says relevance dropped 8% after the last release
✓Engineer knows exactly which dimension to debug
✓CEO sees a 4-dimension trend line improving quarterly
✓Fit is the weakest dimension, invest in personalization
✓Team discusses evidence, not feelings

Building the Rubric: Hamel’s Three Levels Applied

Hamel Husain’s framework gives you the structure. The rubric gives you the content. Here is how they fit together.

Level 1: Assertions (The Spell Check). For each dimension, write concrete pass/fail tests. Relevance: given this input, does the output address the stated topic? Coherence: given a multi-turn conversation, does the AI maintain consistent positions? These are your automated checks that run on every deployment.

Level 2: Human Review (The Teacher). As Hamel describes, you need domain experts reviewing a sample of outputs against the rubric. The key insight from his work: the human reviewers need to agree with each other first before they can evaluate the AI. If your three reviewers cannot agree on what a good response looks like, you do not have a rubric, you have three rubrics.

Level 3: System Metrics (The Semester Grade). This is where the four dimensions become a composite alignment score. You track how each dimension trends over time, where improvements in one dimension might regress another, and whether the overall quality trajectory matches your product goals.

Level 1: Assertions (The Spell Check)

Concrete pass/fail tests for each dimension. Does the output address the topic? Does it maintain consistent positions? Automated checks on every deployment.

Level 2: Human Review (The Teacher)

Domain experts review a sample against the rubric. Critical insight: reviewers must agree with each other first before they can evaluate the AI.

Level 3: System Metrics (The Semester Grade)

Four dimensions become a composite alignment score trending over time. Track regressions across dimensions and overall quality trajectory.

eval-rubric.ts

1// Define your rubric before writing a single eval← The rubric comes first , always
2const rubric = {
3  relevance: { weight: 0.30, threshold: 0.80 },
4  coherence: { weight: 0.25, threshold: 0.85 },
5  depth:     { weight: 0.20, threshold: 0.70 },
6  fit:       { weight: 0.25, threshold: 0.75 },
7};
8
9// Clarity scores each dimension using the self-model← Fit requires knowing the user
10const score = await clarity.evaluate(userId, {
11  interaction: sessionId,
12  rubric: rubric,
13});
14// Returns: { overall: 0.79, relevance: 0.84, coherence: 0.91, depth: 0.67, fit: 0.72 }

The Rubric Changes Everything

When you have a written rubric, three things change immediately.

First, product reviews become specific. Instead of debating whether the AI is good enough, your team discusses which dimension is lagging. The PM can say depth dropped 5 points after the last prompt change. The engineer can correlate that with a specific retrieval modification. The conversation moves from opinions to evidence.

Second, prioritization becomes obvious. If fit is your lowest dimension, you invest in personalization. If depth is low, you improve retrieval and prompting. If relevance is low, you fix intent detection. The rubric tells you where to spend engineering time; not gut instinct.

Third, executive conversations become credible. Executives understand rubrics. They use them for performance reviews, vendor evaluations, and investment decisions. A four-dimension quality breakdown with trend lines is infinitely more credible than a single NPS number.

Specific Reviews

PM says relevance dropped 8% after the last release. Engineer correlates with a specific retrieval change. Evidence, not opinions.

Obvious Priorities

Lowest dimension tells you where to invest. The rubric guides engineering time, not gut instinct.

Credible Conversations

Executives understand rubrics. A four-dimension quality breakdown with trend lines beats a single NPS number every time.

Approach	Explains to Board	Finds Root Cause	Tracks Over Time	Guides Priorities
Vibes	No	No	No	No
NPS Alone	Sort of	No	Yes	No
Engagement Metrics	Yes	No	Yes	Misleading
Four-Dimension Rubric	Yes	Yes	Yes	Yes

Where Self-Models Fit the Rubric

Three of the four dimensions, relevance, coherence, and depth, can be measured with traditional eval techniques. You can write assertions, run human reviews, and track system metrics.

But the fourth dimension, fit, is fundamentally different. You cannot measure whether the output matched the user if you do not know the user. This is where self-models become essential.

A self-model captures the user’s beliefs, goals, expertise level, and context. When you evaluate fit, you are asking: given what we know about this user, did the output serve them? Without a self-model, fit scoring is guesswork. With one, it becomes the most powerful dimension in your rubric.

This is the gap that Hamel’s framework acknowledges but does not fully solve. His eval infrastructure is excellent for measuring output quality. Self-models add the user-understanding layer that makes the rubric personal.

Trade-offs

Building a rubric takes time your team thinks they do not have. The conversation about what good looks like will surface disagreements that have been hiding. This is uncomfortable but necessary, those disagreements were affecting your product whether you acknowledged them or not.

Dimension weights are judgment calls. Two reasonable teams can disagree about whether relevance should be 30% or 40% for their product. The rubric does not eliminate judgment: it channels it to the right questions.

Fit requires infrastructure. You cannot score fit without knowing the user. This means investing in self-model infrastructure or user profiling before you can measure your most important dimension.

Rubrics can become targets. Goodhart’s Law applies. Teams can optimize for rubric scores that do not correlate with real satisfaction if the dimensions are poorly calibrated. Regular recalibration against user outcomes is non-negotiable.

What to Do Next

Write the rubric this week. Block 45 minutes with your product and engineering leads. Define what excellent, acceptable, and poor look like across relevance, coherence, depth, and fit. Write it down. Share it with the team. The act of writing will surface disagreements you did not know existed.
Score 20 interactions by hand. Have three team members independently grade 20 real AI interactions against the rubric. Compare their scores. Where they agree, you have a solid dimension definition. Where they disagree, you have found the calibration work that matters most. This maps directly to Hamel Husain’s approach to building domain-expert consensus [3].
Instrument one dimension. Pick the dimension with the most signal, usually relevance, and add automated measurement. Track it weekly. One rubric-informed metric will tell you more about quality than your entire engagement dashboard. See how Clarity makes rubric scoring automatic with self-models.

Mrs. Chen never let us turn in an essay without reading the rubric first. Your AI product deserves the same discipline. Write the rubric.

References

Building AI that needs to understand its users?

Talk to us →

◉The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →

Why AI Product Teams Need a Quality Language

When the PM says the AI feels off and the engineer says accuracy is fine, they are both right and both wrong. AI teams need a shared language for quality before they can improve it.

Robert Ta's Self-Model

12 min read