Skip to main content

The Grading Rubric for AI Products

Your AI product gets graded every day by users who never hand back the scorecard. Here is how to build the rubric yourself, using the same logic your third-grade teacher used for essay writing.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 8 min read

TL;DR

  • AI products get graded by users every day, but most teams have never written down what an A looks like, they ship on vibes and wonder why retention stalls
  • The grading rubric for AI maps directly to what your third-grade teacher used: relevance, coherence, depth, and fit, applied to every AI interaction
  • Teams that define explicit quality rubrics see measurably better retention because they can finally identify what is broken instead of guessing

A grading rubric for AI products decomposes quality into four measurable dimensions: relevance, coherence, depth, and fit. Without an explicit rubric, most AI teams ship on vibes and cannot explain why retention flatlines after month two. This post covers how to build the four-dimension rubric, score each dimension with concrete methods, and combine them into an alignment score that tracks quality over time.

0%
of AI teams lack formal quality criteria
0x
faster quality improvement with explicit rubrics
0%
of AI churn traced to quality gaps teams cannot name
0
dimensions in a practical AI rubric

Why AI Products Need a Rubric

Every other product discipline has quality frameworks. Restaurants have health inspection scores. Cars have safety ratings. Software has SLAs and uptime guarantees. Hotels have star ratings.

AI products have vibes.

When I ask AI product leaders how they measure quality, the most common answers are user satisfaction surveys, manual spot-checks by the team, and engagement metrics like session length. These are not quality frameworks. They are thermometers, they tell you the patient has a fever but not where the infection is.

A rubric is different from a metric. A metric says the number is 73. A rubric says here is what 73 means, here is what 90 looks like, and here is the specific dimension where you are losing points.

The reason Mrs. Chen’s rubric worked was not that it produced a score. It worked because it decomposed quality into dimensions you could independently improve. If your coherence was strong but your depth was weak, you knew exactly where to focus.

AI products need the same decomposition.

The Four-Dimension AI Rubric

After working with dozens of AI product teams, I have found that AI quality decomposes into four dimensions that map remarkably well to essay grading:

Relevance, did the AI address what the user actually needed? Not what the user typed, but what they meant. A search engine that returns technically correct results for the wrong interpretation of a query scores low on relevance. An AI assistant that answers the literal question but misses the underlying concern scores low on relevance.

Coherence, does the AI output hold together? Are the recommendations consistent with each other? Does the conversation flow logically? An AI that suggests aggressive growth tactics in one message and conservative risk management in the next scores low on coherence.

Depth, did the AI go beyond the surface? Did it provide genuine insight or just restate what the user already knew? An AI that summarizes a document without identifying the key implications scores low on depth. A financial AI that reports the numbers without flagging the anomaly scores low on depth.

Fit, did the output match this specific user? Not a generic user, but the person asking. An AI that gives a technical explanation to a non-technical executive scores low on fit. An AI that provides beginner-level advice to an expert scores low on fit.

Relevance

Did the AI address what the user actually needed? Not what they typed, but what they meant. Measures intent alignment.

Coherence

Does the output hold together? Are recommendations consistent? Does conversation flow logically across interactions?

Depth

Did the AI go beyond the surface? Did it provide genuine insight or just restate what the user already knew?

Fit

Did the output match this specific user? Their expertise, goals, and context. Requires knowing who the user is.

Vibes-Based Quality

  • ×It feels pretty good
  • ×Users seem to like it
  • ×NPS is 42 which is fine I think
  • ×The demo went well
  • ×Nobody has complained recently

Rubric-Based Quality

  • Relevance: 0.84: 12% of responses miss user intent
  • Coherence: 0.91, consistent across multi-turn sessions
  • Depth: 0.67, surfacing insights only 67% of the time
  • Fit: 0.59, personalization is our biggest gap

How to Score Each Dimension

Scoring AI quality does not require a PhD in machine learning. It requires the same thing Mrs. Chen needed: a clear description of what each level looks like.

Relevance scoring starts with intent matching. For every AI interaction, you need to know what the user wanted and whether the AI addressed it. The simplest version: after each interaction, did the user get what they came for? You can measure this with explicit feedback (thumbs up/down), implicit signals (did they rephrase the same question), or outcome tracking (did they complete the task).

Coherence scoring looks at consistency across interactions. Does the AI contradict itself? Does it maintain context? Does its personality and approach stay stable? You can measure this by comparing recommendations across sessions, checking for contradictions in multi-turn conversations, and tracking whether users report confusion.

Depth scoring measures insight density. Is the AI telling users something they did not already know? The simplest proxy: how often does the user follow up with a deeper question versus rephrasing the same question? Follow-up questions signal that the AI opened a new line of thinking. Rephrasings signal it missed the point.

Fit scoring measures personalization quality. Is the output appropriate for this specific user’s context, expertise level, and goals? This is the hardest dimension to measure without a self-model, because you need to know who the user is before you can evaluate whether the output fits them.

Score Relevance

Start with intent matching. Did the user get what they came for? Measure with explicit feedback, implicit signals (rephrasings), or outcome tracking.

Score Coherence

Check consistency across interactions. Does the AI contradict itself? Compare recommendations across sessions and track user confusion.

Score Depth

Measure insight density. Follow-up questions signal the AI opened new thinking. Rephrasings signal it missed the point.

Score Fit

Measure personalization quality. Requires a self-model to know who the user is before evaluating whether the output matches them.

rubric-score.ts
1// Score an AI interaction against the rubricFour dimensions, one score
2const rubric = await clarity.evaluateInteraction(userId, {
3 interaction: sessionId,
4 dimensions: ['relevance', 'coherence', 'depth', 'fit']
5});
6
7// Returns:Quality you can actually discuss
8// {
9// overall: 0.76,
10// relevance: 0.84, // addressed user intent
11// coherence: 0.91, // consistent and logical
12// depth: 0.67, // insight beyond the obvious
13// fit: 0.59, // matched to this specific user
14// weakest: 'fit', // focus improvement here
15// }

The Rubric in Practice

Here is what changes when you adopt a rubric.

Product reviews become specific. Instead of debating whether the AI is good enough, you discuss which dimension is lagging. The conversation shifts from opinions to evidence. Your PM can say depth dropped from 0.72 to 0.65 after the last release instead of saying users seem less happy.

Prioritization becomes clearer. If fit is your lowest dimension, you invest in personalization infrastructure. If depth is low, you improve your prompts and retrieval pipeline. If relevance is low, you fix intent detection. The rubric tells you where to spend your engineering time.

Board conversations become credible. Executives understand rubrics. They use them for performance reviews, vendor evaluations, and investment decisions. Showing a four-dimension quality breakdown with trendlines is infinitely more credible than showing an NPS score.

Product Reviews

Shift from “the AI feels off” to “depth dropped from 0.72 to 0.65 after the last release.” Opinions become evidence.

Prioritization

Lowest dimension tells you where to invest. Low fit = personalization. Low depth = prompts and retrieval. Low relevance = intent detection.

Board Credibility

A four-dimension quality breakdown with trend lines is infinitely more credible than a single NPS number.

Quality ApproachCan Explain to BoardIdentifies Root CauseTracks Over TimeGuides Prioritization
VibesNoNoNoNo
NPS aloneSort ofNoYesNo
Engagement metricsYesNoYesMisleading
Four-dimension rubricYesYesYesYes

From Rubric to Alignment Score

The rubric becomes truly powerful when you combine the four dimensions into a single alignment score: a number that represents how well your AI served this user in this interaction.

The alignment score is not a simple average. Relevance matters more than depth for task-oriented interactions. Fit matters more than coherence for personalization-heavy products. The weights reflect your product’s priorities.

But the composite score gives you something invaluable: a single number that trends over time, that you can set targets against, and that everyone from the engineer to the CEO understands. It is the GPA of your AI product.

And just like GPA, it is useful precisely because it decomposes. A 3.2 GPA tells you the student is above average. But the transcript: the rubric, tells you they are an A student in relevance and a C student in fit. That is where the work happens.

Trade-offs

Rubrics require infrastructure to measure. You cannot score fit without knowing who the user is. You cannot score relevance without knowing what the user wanted. This means investing in self-model infrastructure before you can fully implement the rubric.

Dimension weights are subjective. Two reasonable people can disagree about whether relevance should be weighted 30% or 40% for your specific product. The rubric does not eliminate judgment: it focuses it on the right questions.

Not every interaction is gradable. Exploratory conversations, creative brainstorming, and open-ended discovery do not have a clear right answer to grade against. The rubric works best for task-oriented and advisory AI interactions.

The rubric can become the goal. Goodhart’s Law applies: when a measure becomes a target, it ceases to be a good measure. Teams can optimize for rubric scores that do not correlate with user satisfaction if the dimensions are poorly calibrated. Regular recalibration against real user outcomes is essential.

What to Do Next

  1. Write down your rubric. Take 30 minutes and define what good, acceptable, and poor look like across relevance, coherence, depth, and fit for your AI product. Share it with your team. The act of writing it down will surface disagreements you did not know existed.

  2. Score 20 interactions manually. Before automating anything, have three team members independently score 20 real AI interactions using the rubric. Compare scores. Where they agree, you have a clear dimension. Where they disagree, you have a calibration opportunity.

  3. Instrument one dimension. Pick the dimension where you have the most signal, usually relevance, and add automated measurement. Track it weekly. This single metric will reveal more about your AI quality than your entire engagement dashboard. See how Clarity makes rubric scoring automatic.


Every teacher knows: if you do not define what good looks like before grading, every paper gets a B-minus. Your AI product is getting a B-minus right now. Build the rubric.

References

  1. only 1 in 26 unhappy customers actually complains
  2. not a reliable predictor of customer retention
  3. sampling bias, non-response bias, cultural bias, and questionnaire bias
  4. Qualtrics notes in their churn prediction framework
  5. NPS does not correlate with renewal or churn

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →