Skip to main content

AI Evals Explained: A Product Manager Guide

AI evals are not as complicated as engineers make them sound. If you can understand restaurant health inspections, driving tests, and essay rubrics, you can understand AI evals.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 8 min read

TL;DR

  • AI evals measure how good your AI is: like a restaurant health score measures food safety or a driving test measures road readiness, but the jargon makes them feel inaccessible to non-engineers
  • The three types of evals map to familiar concepts: accuracy evals are like driving tests, quality evals are like essay rubrics, and alignment evals are like performance reviews
  • Product managers who understand evals make better product decisions because they can prioritize improvements that matter to users, not just improvements that move benchmark numbers

AI evals are systematic measurements of how well an AI product performs its job, analogous to restaurant health inspections, driving tests, and essay rubrics that product managers already understand. The jargon (BLEU scores, perplexity, F1 metrics) creates a false barrier that prevents PMs from making informed quality decisions. This post covers the three types of evals every PM should know, how to translate engineering metrics into product language, and how to use evals for roadmap prioritization.

0
types of AI evals every PM should know
0
require a PhD to understand
0%
of PMs say evals are unclear to them
0
analogy to understand each type

Type 1: Accuracy Evals (The Driving Test)

A driving test checks whether you can do the basics. Can you parallel park? Do you stop at red lights? Can you merge safely? It is pass/fail on specific tasks.

Accuracy evals work the same way. You give the AI a set of questions with known correct answers and measure how often it gets them right. Can the AI correctly categorize this support ticket? Does it extract the right data from this document? Does it answer this factual question correctly?

What PMs need to know: Accuracy evals tell you whether the AI can do the job at all. They are the minimum bar. If accuracy is low, nothing else matters, like a driver who cannot stop at red lights.

What accuracy does NOT tell you: A driver who passes the driving test can be a terrible driver in real life. They might be slow, aggressive, or confusing to other drivers. Similarly, an AI with high accuracy might produce correct but useless responses, technically right but not helpful.

The PM question to ask: When engineers report accuracy improvements, ask: accuracy on what? The test set matters. A driving test on an empty parking lot tells you less than a driving test in traffic. An accuracy eval on clean, simple inputs tells you less than an eval on messy, real-world inputs.

What Accuracy Tells You

Whether the AI can do the job at all. Can it categorize tickets? Extract data? Answer factual questions? The minimum bar for production readiness.

What Accuracy Does NOT Tell You

Whether responses are useful. An AI with high accuracy can be technically right but not helpful, like a driver who passes the test but is terrible on real roads.

Type 2: Quality Evals (The Essay Rubric)

An essay rubric goes beyond right or wrong. It measures multiple dimensions: relevance, coherence, depth, and voice. A factually correct essay can still be poorly written. A well-written essay can miss the point.

Quality evals measure the same dimensions for AI. Not just whether the answer is correct, but whether it is well-structured, complete, appropriately detailed, and actually useful to the person who asked.

What PMs need to know: Quality evals are where product differentiation lives. Two AI products can have identical accuracy, but one feels dramatically better because it is more relevant, more coherent, deeper, and better fit to the user.

The rubric dimensions for AI:

  • Relevance: Did the AI address what the user actually meant? (Like answering the essay prompt, not a related but different question)
  • Coherence: Does the response hold together logically? (Like an essay that flows from point to point)
  • Depth: Did the AI go beyond the obvious? (Like an essay that adds insight, not just restates the prompt)
  • Fit: Did the response match this specific user? (Like an essay written for the right audience)

The PM question to ask: When the team says quality is improving, ask: which dimension? Improving from 0.65 to 0.72 overall is vague. Improving coherence from 0.65 to 0.72 while fit dropped from 0.58 to 0.52 tells you where to invest.

How Engineers Report Evals

  • ×F1 score improved 3.2%
  • ×BLEU score: 0.72
  • ×Perplexity down to 12.4
  • ×Latency p95: 340ms
  • ×PM reaction: nodding while confused

How PMs Should Hear Evals

  • Accuracy improved from 84% to 87% on real user queries
  • Quality rubric: relevance 0.84, coherence 0.91, depth 0.67, fit 0.59
  • Response clarity improved but personalization still lags
  • Users get answers 15% faster
  • PM reaction: knowing exactly where to prioritize

Type 3: Alignment Evals (The Performance Review)

A performance review does not just ask can this person do the job. It asks does this person serve the organization’s goals while developing their own capabilities. It measures fit between the individual and the context.

Alignment evals measure the fit between the AI and each specific user. Not just whether the response was good in general, but whether it was good for this person, with their goals, expertise level, and preferences.

What PMs need to know: Alignment evals are the most important and the least common. Most teams measure accuracy and maybe quality. Almost none measure alignment, whether the AI actually served what the user needed.

This matters because two users can ask the same question and need completely different responses. A beginner asking how do I optimize my database needs a different answer than a senior DBA asking the same question. Accuracy says both answers are correct. Quality says both are well-written. Alignment says only one of them is right for the person asking.

The PM question to ask: Are we measuring quality per-user or in aggregate? Aggregate quality can be high while individual alignment is low, because the average hides the variance. The aggregate says 0.75 average quality. The distribution says 30% of users are below 0.50.

The Three Types Together

Think of it as a progression:

Accuracy is the driver’s license, can the AI do the job at all?

Quality is the driving instructor’s rubric, how well does the AI do the job across multiple dimensions?

Alignment is the passenger’s experience, does this specific ride serve this specific passenger?

You need all three, but most teams stop at accuracy, some add quality, and very few measure alignment. The closer you get to alignment, the closer you get to measuring what users actually experience.

Level 1: Accuracy (The Driving Test)

Can the AI do the basics correctly? Pass/fail on specific tasks with known correct answers. The minimum bar that most teams measure.

Level 2: Quality (The Essay Rubric)

How well does the AI perform across relevance, coherence, depth, and fit? Where product differentiation lives. Some teams measure this.

Level 3: Alignment (The Performance Review)

Does the AI serve each specific user’s goals? The most important and least common eval type. Very few teams measure this, but it is closest to real user experience.

Eval TypeAnalogyWhat It MeasuresWho Cares
AccuracyDriving testCan the AI do the basics correctlyEngineering
QualityEssay rubricHow well the AI performs across dimensionsProduct + Engineering
AlignmentPerformance reviewHow well the AI serves each specific userProduct + Users + Business
eval-types.ts
1// Three eval types, one APIMeasure what actually matters
2const accuracy = await clarity.evalAccuracy(testSet);
3// Result: 0.87, the AI gets it right 87% of the time
4
5const quality = await clarity.evalQuality(interactions);
6// Result: { relevance: 0.84, coherence: 0.91, depth: 0.67, fit: 0.59 }
7
8const alignment = await clarity.evalAlignment(userId, interactions);
9// Result: 0.72, for THIS user, the AI served their goals 72% of the time
10
11// The PM insight:Accuracy is fine. Quality is mixed. Alignment is the gap.
12// Focus investment on fit (0.59) and alignment (0.72)
13// Not on accuracy (0.87) which is already strong

A PM’s Eval Cheat Sheet

Here is how to translate the most common eval terms into concepts you already understand:

Accuracy / F1 score = The driving test pass rate. Higher is better. Above 0.85 is generally good for production.

BLEU score = How closely the AI’s writing matches a reference answer. Like grading an essay by comparing it to a model essay. Useful for translation and summarization, less useful for open-ended tasks.

Perplexity = How surprised the model is by the next word. Lower is better. Like a driving instructor measuring how smooth the driver’s reactions are. Useful for model selection, not very useful for product decisions.

Latency = How fast the AI responds. Like measuring how quickly the restaurant serves your food. Directly impacts user experience.

Alignment score = How well the AI served the user’s actual goals. Like a performance review score. The metric that most directly predicts user satisfaction and retention.

Accuracy / F1 Score

The driving test pass rate. Higher is better. Above 0.85 is generally good for production. Useful for engineering decisions.

BLEU Score

How closely AI writing matches a reference answer. Like grading an essay against a model essay. Useful for translation, less so for open-ended tasks.

Perplexity

How surprised the model is by the next word. Lower is better. Useful for model selection, not very useful for product decisions.

Alignment Score

How well the AI served the user’s actual goals. Like a performance review score. Most directly predicts user satisfaction and retention.

How to Use Evals as a PM

In roadmap discussions: Use quality eval dimensions to prioritize features. If fit is your lowest dimension, prioritize personalization. If depth is low, prioritize the knowledge pipeline. Evals turn subjective roadmap debates into evidence-based prioritization.

In engineering reviews: Ask which eval improved and by how much. If the team shipped an improvement, there should be a measurable change in at least one eval dimension. If there is not, the improvement may not be user-visible.

In board presentations: Use the alignment score as your quality headline. Boards understand a 0-to-1 score that trends over time. Show the rubric breakdown (relevance, coherence, depth, fit) as the drill-down when they ask for detail.

In competitive analysis: Compare products on alignment, not accuracy. Most competitors have similar accuracy because they use similar models. Alignment, how well the product serves individual users, is where differentiation happens.

In Roadmap Discussions

Use quality eval dimensions to prioritize features. Low fit score? Prioritize personalization. Low depth? Prioritize the knowledge pipeline. Evidence over debate.

In Engineering Reviews

Ask which eval improved and by how much. If the team shipped an improvement, there should be a measurable change in at least one eval dimension.

In Board Presentations

Use alignment score as your quality headline. Boards understand a 0-to-1 score that trends over time. Show the rubric breakdown as the drill-down.

In Competitive Analysis

Compare on alignment, not accuracy. Most competitors have similar accuracy from similar models. Alignment is where differentiation happens.

Trade-offs

Evals are snapshots, not movies. A single eval measures quality at one point in time. Quality changes with product updates, user base changes, and model updates. Run evals continuously, not once per quarter.

Good evals require good test data. The eval is only as good as the inputs you test with. If your test set is clean and simple but your real users send messy, ambiguous queries, your eval overestimates quality.

Alignment evals require user context. You cannot measure alignment without knowing what the user wanted. This means you need self-model infrastructure before you can run meaningful alignment evals.

Not everything is evaluable. Creative, exploratory, and open-ended interactions resist clean evaluation. Some AI value is in the exploration itself, not in reaching a correct answer.

What to Do Next

  1. Ask your engineering team for the rubric. Next time they report eval results, ask them to translate into the four quality dimensions: relevance, coherence, depth, fit. If they cannot, the evals are measuring model performance, not product quality.

  2. Run a manual alignment eval. Pull 10 real user interactions. For each one, look at the user’s query, the AI’s response, and ask: did this response serve what this specific user needed? Score it 1-5. Average the scores. This is your first alignment eval.

  3. Instrument continuous alignment scoring. Move from periodic manual evals to continuous automated measurement. Track alignment per-user and in aggregate. Use the trend to guide product decisions. See how Clarity makes AI evals automatic.


Restaurant health scores. Driving tests. Essay rubrics. You already understand evaluation. Now apply it to AI. Start measuring alignment.

References

  1. only 1 in 26 unhappy customers actually complains
  2. Qualtrics notes in their churn prediction framework
  3. Research from Epsilon
  4. Bain & Company research
  5. lagging indicators

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →