Why Your Evals Are Lying to You

High eval scores, angry users. The gap between what you measure and what users experience exists because your evals test the AI in isolation, not in the context of the person using it.

Robert Ta's Self-Model CEO & Co-Founder 847 beliefs

· November 6, 2025 · 7 min read

TL;DR

High eval scores can coexist with angry users when evals measure output correctness in isolation from user context, the AI produces good answers to the wrong question for the wrong person
The eval gap exists because most evaluation systems treat every user identically, scoring against generic quality criteria that miss the individual experience
Closing the gap requires adding user awareness to your eval system, measuring not just whether the output is correct but whether it served the specific person who received it

AI evals lie when they measure output correctness in isolation from user context, producing high scores while users churn because the AI does not understand them. The same response can score perfectly on accuracy benchmarks while completely failing the specific person who received it. This post covers the three ways evals deceive (vacuum testing, averaging over distributions, and temporal drift), the alignment gap between eval scores and user satisfaction, and how to add user awareness to your evaluation system.

eval score while users were churning

support tickets in the same week

ways your evals are lying to you

user context in typical eval suites

Lie 1: Testing in a Vacuum

Your eval suite tests the AI on inputs in isolation. Here is the input. Here is the expected output. Does it match? Score.

This approach misses the user entirely. The same input from two different users should produce different outputs, different depth, different assumptions, different emphasis. But your eval suite scores both outputs against the same expected answer.

Imagine a car safety rating that tested every car with the same size crash test dummy. The rating would be accurate for people who match the dummy’s dimensions and misleading for everyone else. AI evals have the same problem, they test with a generic user and declare quality for all users.

The fix is not removing generic evals. It is supplementing them with user-contextualized evals. Test the same output against the question: was this good for this specific user?

Generic Eval: “Is This Correct?”

Tests the AI on inputs in isolation. Same expected output for all users. Scores against a universal reference. Accurate for the “average” user, misleading for everyone else.

Contextualized Eval: “Is This Right for This User?”

Tests the AI output against the specific user’s expertise, goals, and context. Two users asking the same question get evaluated against different criteria. Scores correlate with actual satisfaction.

Lie 2: Averaging Hides Distribution

Your quality score is 0.89. That sounds great. But 0.89 is an average across all users. What does the distribution look like?

Maybe 70% of users experience 0.95 quality and 30% experience 0.75 quality. The average is 0.89, technically accurate. But the 30% who get 0.75 are churning, and your average is hiding their experience.

This distribution problem is particularly severe for personalization. Users with common profiles get great experiences because the AI was trained on data that represents them. Users with unusual profiles, different expertise levels, different goals, different communication styles, get generic experiences that the average obscures.

As Hamel Husain argues in his eval framework [1], the way to catch this is to look at the tails, not the middle. The users your AI serves worst are the ones who will leave first, and they are invisible in an average.

Lie 3: Past Performance Predicts Future Quality

Your eval suite was built six months ago. It tests scenarios from six months ago. Your product has evolved. Your users have evolved. Your use cases have expanded. But your eval suite is still grading against the old expectations.

This temporal drift is insidious. The eval scores remain high because the test cases are stale, they test scenarios the AI has been optimized for, not the scenarios users actually encounter today. The result is a steadily increasing gap between eval scores and real-world quality.

The fix is continuous eval maintenance. Every month, replace 20% of your test cases with fresh examples from real user interactions. Prioritize cases where users expressed dissatisfaction, those are the scenarios your eval suite is probably missing.

Lie 1: Vacuum Testing

Tests output correctness without knowing who the user is. Like rating car safety with one size of crash test dummy. Accurate for one profile, misleading for all others.

Lie 2: Averaging Over Distribution

Reports 0.89 average while 30% of users experience 0.75 quality. The churning users are invisible in the average. The tails matter more than the middle.

Lie 3: Temporal Drift

Test cases from six months ago test scenarios the AI has been optimized for, not what users encounter today. Scores remain high while real quality degrades.

Evals That Lie

×Tests output correctness in isolation
×Reports average quality across all users
×Uses test cases from six months ago
×No user context in evaluation criteria
×Dashboard shows 0.89 while users churn

Evals That Tell the Truth

✓Tests output quality in user context
✓Reports quality distribution, not just averages
✓Refreshes 20% of test cases monthly
✓Self-models provide user context to evaluation
✓Dashboard shows quality per user segment

The Alignment Gap

The gap between eval scores and user satisfaction has a name: the alignment gap. It exists because your evals measure output quality (is this response correct?) while your users experience alignment quality (does this AI understand me?).

Output quality is necessary but not sufficient. A perfectly correct response that is pitched at the wrong level for the user, that ignores their stated goals, or that contradicts advice given in a previous session is a quality response to the wrong person.

Alignment quality asks a different question: given what we know about this user, their expertise, their goals, their communication preferences, their history: how well did the AI serve them? This is a fundamentally different measurement from correctness.

The alignment gap is measurable. Compare your output quality score (what your evals report) with your user satisfaction score (what users report). The difference is the alignment gap. A large gap means your evals are measuring the wrong thing. A small gap means your evals are well-calibrated to user experience.

alignment-aware-eval.ts

1// Close the alignment gap: eval with user context← Same test, different expectations per user
2const userModel = await clarity.getSelfModel(userId);
3
4const evaluation = await clarity.evaluate({
5  response: aiOutput,
6  query: userQuery,
7  userContext: userModel,
8  dimensions: {
9    correctness: 'Is the information accurate?',
10    relevance: 'Does it address this user\'s actual intent?',
11    fit: 'Is the depth/tone/style right for THIS user?',
12    coherence: 'Is it consistent with previous interactions?',
13  },
14});
15
16// Now your eval score correlates with user satisfaction

How to Stop the Lying

Step 1: Measure the gap. Take 50 recent interactions where you have both eval scores and user satisfaction signals (thumbs up/down, completion rates, follow-up behavior). Calculate the correlation between your eval scores and user satisfaction. If the correlation is below 0.7, your evals are significantly disconnected from user experience.

Step 2: Segment your quality data. Instead of reporting a single quality number, break it down by user type: new versus returning users, expert versus novice, high-engagement versus low-engagement. The segmented view will reveal which user populations your evals accurately represent and which are invisible.

Step 3: Add user context to a subset of evals. You do not need to overhaul your entire eval system. Start by adding user context, expertise level, goals, preferences, to 20% of your eval cases. Compare the contextualized scores to the generic scores. The gap reveals how much personalization quality your current evals miss.

Step 4: Track the alignment gap over time. Make the gap between eval scores and user satisfaction a first-class metric. Report it alongside quality scores. When the gap widens, your evals need recalibration. When it narrows, your quality measurement is improving.

Step 1: Measure the Gap

Pull 50 interactions with both eval scores and user satisfaction signals. Calculate the correlation. Below 0.7 means your evals are significantly disconnected from user experience.

Step 2: Segment Your Quality Data

Break down by user type: new vs returning, expert vs novice, high-engagement vs low-engagement. The segmented view reveals which populations your evals accurately represent.

Step 3: Add User Context to a Subset

Add expertise level, goals, and preferences to 20% of eval cases. Compare contextualized scores to generic scores. The gap reveals how much personalization quality your current evals miss.

Step 4: Track the Alignment Gap Over Time

Make the gap between eval scores and user satisfaction a first-class metric. When the gap widens, recalibrate evals. When it narrows, your quality measurement is improving.

What Evals Measure	What Users Experience	The Gap
Output correctness	Was it useful to me?	Intent alignment
Average quality	My personal quality	Distribution blindness
Performance on test set	Performance on my use cases	Temporal drift
Generic quality criteria	Quality for my context	Personalization blindness

Trade-offs

Adding user context to evals increases complexity. Each eval case now requires user context alongside the input and expected output. This is more work to maintain, but it is the only way to measure what users actually experience.

Distribution analysis requires more data. You cannot segment quality by user type with 50 test cases. You need volume to see meaningful distributions. This is where LLM-as-judge or automated scoring becomes essential.

Continuous eval maintenance is ongoing work. Refreshing 20% of test cases monthly requires discipline. It is tempting to let the eval suite stagnate. Budget the time as non-negotiable quality infrastructure.

The alignment gap is uncomfortable. When you first measure it, the gap will be larger than you expect. This can feel like a failure. It is not, it is visibility. You cannot close a gap you cannot see.

What to Do Next

Calculate your alignment gap this week. Pull 50 interactions with both eval scores and user satisfaction signals. Calculate the correlation. If it is below 0.7, your evals are lying to you about user experience. This single number will change how you think about quality.
Segment your quality data. Break your quality scores by user type. Find the populations where quality is lowest. These are the users churning, and your average is hiding them.
Add user context to your next eval cycle. For your next round of human review or LLM-as-judge evaluation, include user context, even basic attributes like expertise level and primary goal. Clarity provides the self-model API that makes user-aware evaluation possible. Watch how the scores change when the evaluator knows who the output was for.

Your evals say quality is 0.89. Your users say otherwise. Believe the users. Close the alignment gap.

References

Building AI that needs to understand its users?

Talk to us →

◉The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →

The Grading Rubric Your AI Product Is Missing

Your English teacher had a rubric for every essay. Your AI product has vibes. How to apply Hamel Husain's 3-level eval framework to build the quality criteria your team never wrote down.

Robert Ta's Self-Model

11 min read