AI Evals Explained: A Product Manager Guide
AI evals are not as complicated as engineers make them sound. If you can understand restaurant health inspections, driving tests, and essay rubrics, you can understand AI evals.
TL;DR
- AI evals measure how good your AI is: like a restaurant health score measures food safety or a driving test measures road readiness, but the jargon makes them feel inaccessible to non-engineers
- The three types of evals map to familiar concepts: accuracy evals are like driving tests, quality evals are like essay rubrics, and alignment evals are like performance reviews
- Product managers who understand evals make better product decisions because they can prioritize improvements that matter to users, not just improvements that move benchmark numbers
AI evals are systematic measurements of how well an AI product performs its job, analogous to restaurant health inspections, driving tests, and essay rubrics that product managers already understand. The jargon (BLEU scores, perplexity, F1 metrics) creates a false barrier that prevents PMs from making informed quality decisions. This post covers the three types of evals every PM should know, how to translate engineering metrics into product language, and how to use evals for roadmap prioritization.
Type 1: Accuracy Evals (The Driving Test)
A driving test checks whether you can do the basics. Can you parallel park? Do you stop at red lights? Can you merge safely? It is pass/fail on specific tasks.
Accuracy evals work the same way. You give the AI a set of questions with known correct answers and measure how often it gets them right. Can the AI correctly categorize this support ticket? Does it extract the right data from this document? Does it answer this factual question correctly?
What PMs need to know: Accuracy evals tell you whether the AI can do the job at all. They are the minimum bar. If accuracy is low, nothing else matters, like a driver who cannot stop at red lights.
What accuracy does NOT tell you: A driver who passes the driving test can be a terrible driver in real life. They might be slow, aggressive, or confusing to other drivers. Similarly, an AI with high accuracy might produce correct but useless responses, technically right but not helpful.
The PM question to ask: When engineers report accuracy improvements, ask: accuracy on what? The test set matters. A driving test on an empty parking lot tells you less than a driving test in traffic. An accuracy eval on clean, simple inputs tells you less than an eval on messy, real-world inputs.
What Accuracy Tells You
Whether the AI can do the job at all. Can it categorize tickets? Extract data? Answer factual questions? The minimum bar for production readiness.
What Accuracy Does NOT Tell You
Whether responses are useful. An AI with high accuracy can be technically right but not helpful, like a driver who passes the test but is terrible on real roads.
Type 2: Quality Evals (The Essay Rubric)
An essay rubric goes beyond right or wrong. It measures multiple dimensions: relevance, coherence, depth, and voice. A factually correct essay can still be poorly written. A well-written essay can miss the point.
Quality evals measure the same dimensions for AI. Not just whether the answer is correct, but whether it is well-structured, complete, appropriately detailed, and actually useful to the person who asked.
What PMs need to know: Quality evals are where product differentiation lives. Two AI products can have identical accuracy, but one feels dramatically better because it is more relevant, more coherent, deeper, and better fit to the user.
The rubric dimensions for AI:
- Relevance: Did the AI address what the user actually meant? (Like answering the essay prompt, not a related but different question)
- Coherence: Does the response hold together logically? (Like an essay that flows from point to point)
- Depth: Did the AI go beyond the obvious? (Like an essay that adds insight, not just restates the prompt)
- Fit: Did the response match this specific user? (Like an essay written for the right audience)
The PM question to ask: When the team says quality is improving, ask: which dimension? Improving from 0.65 to 0.72 overall is vague. Improving coherence from 0.65 to 0.72 while fit dropped from 0.58 to 0.52 tells you where to invest.
How Engineers Report Evals
- ×F1 score improved 3.2%
- ×BLEU score: 0.72
- ×Perplexity down to 12.4
- ×Latency p95: 340ms
- ×PM reaction: nodding while confused
How PMs Should Hear Evals
- ✓Accuracy improved from 84% to 87% on real user queries
- ✓Quality rubric: relevance 0.84, coherence 0.91, depth 0.67, fit 0.59
- ✓Response clarity improved but personalization still lags
- ✓Users get answers 15% faster
- ✓PM reaction: knowing exactly where to prioritize
Type 3: Alignment Evals (The Performance Review)
A performance review does not just ask can this person do the job. It asks does this person serve the organization’s goals while developing their own capabilities. It measures fit between the individual and the context.
Alignment evals measure the fit between the AI and each specific user. Not just whether the response was good in general, but whether it was good for this person, with their goals, expertise level, and preferences.
What PMs need to know: Alignment evals are the most important and the least common. Most teams measure accuracy and maybe quality. Almost none measure alignment, whether the AI actually served what the user needed.
This matters because two users can ask the same question and need completely different responses. A beginner asking how do I optimize my database needs a different answer than a senior DBA asking the same question. Accuracy says both answers are correct. Quality says both are well-written. Alignment says only one of them is right for the person asking.
The PM question to ask: Are we measuring quality per-user or in aggregate? Aggregate quality can be high while individual alignment is low, because the average hides the variance. The aggregate says 0.75 average quality. The distribution says 30% of users are below 0.50.
The Three Types Together
Think of it as a progression:
Accuracy is the driver’s license, can the AI do the job at all?
Quality is the driving instructor’s rubric, how well does the AI do the job across multiple dimensions?
Alignment is the passenger’s experience, does this specific ride serve this specific passenger?
You need all three, but most teams stop at accuracy, some add quality, and very few measure alignment. The closer you get to alignment, the closer you get to measuring what users actually experience.
Level 1: Accuracy (The Driving Test)
Can the AI do the basics correctly? Pass/fail on specific tasks with known correct answers. The minimum bar that most teams measure.
Level 2: Quality (The Essay Rubric)
How well does the AI perform across relevance, coherence, depth, and fit? Where product differentiation lives. Some teams measure this.
Level 3: Alignment (The Performance Review)
Does the AI serve each specific user’s goals? The most important and least common eval type. Very few teams measure this, but it is closest to real user experience.
| Eval Type | Analogy | What It Measures | Who Cares |
|---|---|---|---|
| Accuracy | Driving test | Can the AI do the basics correctly | Engineering |
| Quality | Essay rubric | How well the AI performs across dimensions | Product + Engineering |
| Alignment | Performance review | How well the AI serves each specific user | Product + Users + Business |
1// Three eval types, one API← Measure what actually matters2const accuracy = await clarity.evalAccuracy(testSet);3// Result: 0.87, the AI gets it right 87% of the time45const quality = await clarity.evalQuality(interactions);6// Result: { relevance: 0.84, coherence: 0.91, depth: 0.67, fit: 0.59 }78const alignment = await clarity.evalAlignment(userId, interactions);9// Result: 0.72, for THIS user, the AI served their goals 72% of the time1011// The PM insight:← Accuracy is fine. Quality is mixed. Alignment is the gap.12// Focus investment on fit (0.59) and alignment (0.72)13// Not on accuracy (0.87) which is already strong
A PM’s Eval Cheat Sheet
Here is how to translate the most common eval terms into concepts you already understand:
Accuracy / F1 score = The driving test pass rate. Higher is better. Above 0.85 is generally good for production.
BLEU score = How closely the AI’s writing matches a reference answer. Like grading an essay by comparing it to a model essay. Useful for translation and summarization, less useful for open-ended tasks.
Perplexity = How surprised the model is by the next word. Lower is better. Like a driving instructor measuring how smooth the driver’s reactions are. Useful for model selection, not very useful for product decisions.
Latency = How fast the AI responds. Like measuring how quickly the restaurant serves your food. Directly impacts user experience.
Alignment score = How well the AI served the user’s actual goals. Like a performance review score. The metric that most directly predicts user satisfaction and retention.
Accuracy / F1 Score
The driving test pass rate. Higher is better. Above 0.85 is generally good for production. Useful for engineering decisions.
BLEU Score
How closely AI writing matches a reference answer. Like grading an essay against a model essay. Useful for translation, less so for open-ended tasks.
Perplexity
How surprised the model is by the next word. Lower is better. Useful for model selection, not very useful for product decisions.
Alignment Score
How well the AI served the user’s actual goals. Like a performance review score. Most directly predicts user satisfaction and retention.
How to Use Evals as a PM
In roadmap discussions: Use quality eval dimensions to prioritize features. If fit is your lowest dimension, prioritize personalization. If depth is low, prioritize the knowledge pipeline. Evals turn subjective roadmap debates into evidence-based prioritization.
In engineering reviews: Ask which eval improved and by how much. If the team shipped an improvement, there should be a measurable change in at least one eval dimension. If there is not, the improvement may not be user-visible.
In board presentations: Use the alignment score as your quality headline. Boards understand a 0-to-1 score that trends over time. Show the rubric breakdown (relevance, coherence, depth, fit) as the drill-down when they ask for detail.
In competitive analysis: Compare products on alignment, not accuracy. Most competitors have similar accuracy because they use similar models. Alignment, how well the product serves individual users, is where differentiation happens.
In Roadmap Discussions
Use quality eval dimensions to prioritize features. Low fit score? Prioritize personalization. Low depth? Prioritize the knowledge pipeline. Evidence over debate.
In Engineering Reviews
Ask which eval improved and by how much. If the team shipped an improvement, there should be a measurable change in at least one eval dimension.
In Board Presentations
Use alignment score as your quality headline. Boards understand a 0-to-1 score that trends over time. Show the rubric breakdown as the drill-down.
In Competitive Analysis
Compare on alignment, not accuracy. Most competitors have similar accuracy from similar models. Alignment is where differentiation happens.
Trade-offs
Evals are snapshots, not movies. A single eval measures quality at one point in time. Quality changes with product updates, user base changes, and model updates. Run evals continuously, not once per quarter.
Good evals require good test data. The eval is only as good as the inputs you test with. If your test set is clean and simple but your real users send messy, ambiguous queries, your eval overestimates quality.
Alignment evals require user context. You cannot measure alignment without knowing what the user wanted. This means you need self-model infrastructure before you can run meaningful alignment evals.
Not everything is evaluable. Creative, exploratory, and open-ended interactions resist clean evaluation. Some AI value is in the exploration itself, not in reaching a correct answer.
What to Do Next
-
Ask your engineering team for the rubric. Next time they report eval results, ask them to translate into the four quality dimensions: relevance, coherence, depth, fit. If they cannot, the evals are measuring model performance, not product quality.
-
Run a manual alignment eval. Pull 10 real user interactions. For each one, look at the user’s query, the AI’s response, and ask: did this response serve what this specific user needed? Score it 1-5. Average the scores. This is your first alignment eval.
-
Instrument continuous alignment scoring. Move from periodic manual evals to continuous automated measurement. Track alignment per-user and in aggregate. Use the trend to guide product decisions. See how Clarity makes AI evals automatic.
Restaurant health scores. Driving tests. Essay rubrics. You already understand evaluation. Now apply it to AI. Start measuring alignment.
References
- only 1 in 26 unhappy customers actually complains
- Qualtrics notes in their churn prediction framework
- Research from Epsilon
- Bain & Company research
- lagging indicators
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →