Skip to main content

Why Your Evals Are Lying to You

High eval scores, angry users. The gap between what you measure and what users experience exists because your evals test the AI in isolation, not in the context of the person using it.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 7 min read

TL;DR

  • High eval scores can coexist with angry users when evals measure output correctness in isolation from user context, the AI produces good answers to the wrong question for the wrong person
  • The eval gap exists because most evaluation systems treat every user identically, scoring against generic quality criteria that miss the individual experience
  • Closing the gap requires adding user awareness to your eval system, measuring not just whether the output is correct but whether it served the specific person who received it

AI evals lie when they measure output correctness in isolation from user context, producing high scores while users churn because the AI does not understand them. The same response can score perfectly on accuracy benchmarks while completely failing the specific person who received it. This post covers the three ways evals deceive (vacuum testing, averaging over distributions, and temporal drift), the alignment gap between eval scores and user satisfaction, and how to add user awareness to your evaluation system.

0%
eval score while users were churning
0
support tickets in the same week
0
ways your evals are lying to you
0
user context in typical eval suites

Lie 1: Testing in a Vacuum

Your eval suite tests the AI on inputs in isolation. Here is the input. Here is the expected output. Does it match? Score.

This approach misses the user entirely. The same input from two different users should produce different outputs, different depth, different assumptions, different emphasis. But your eval suite scores both outputs against the same expected answer.

Imagine a car safety rating that tested every car with the same size crash test dummy. The rating would be accurate for people who match the dummy’s dimensions and misleading for everyone else. AI evals have the same problem, they test with a generic user and declare quality for all users.

The fix is not removing generic evals. It is supplementing them with user-contextualized evals. Test the same output against the question: was this good for this specific user?

Generic Eval: “Is This Correct?”

Tests the AI on inputs in isolation. Same expected output for all users. Scores against a universal reference. Accurate for the “average” user, misleading for everyone else.

Contextualized Eval: “Is This Right for This User?”

Tests the AI output against the specific user’s expertise, goals, and context. Two users asking the same question get evaluated against different criteria. Scores correlate with actual satisfaction.

Lie 2: Averaging Hides Distribution

Your quality score is 0.89. That sounds great. But 0.89 is an average across all users. What does the distribution look like?

Maybe 70% of users experience 0.95 quality and 30% experience 0.75 quality. The average is 0.89, technically accurate. But the 30% who get 0.75 are churning, and your average is hiding their experience.

This distribution problem is particularly severe for personalization. Users with common profiles get great experiences because the AI was trained on data that represents them. Users with unusual profiles, different expertise levels, different goals, different communication styles, get generic experiences that the average obscures.

As Hamel Husain argues in his eval framework [1], the way to catch this is to look at the tails, not the middle. The users your AI serves worst are the ones who will leave first, and they are invisible in an average.

Lie 3: Past Performance Predicts Future Quality

Your eval suite was built six months ago. It tests scenarios from six months ago. Your product has evolved. Your users have evolved. Your use cases have expanded. But your eval suite is still grading against the old expectations.

This temporal drift is insidious. The eval scores remain high because the test cases are stale, they test scenarios the AI has been optimized for, not the scenarios users actually encounter today. The result is a steadily increasing gap between eval scores and real-world quality.

The fix is continuous eval maintenance. Every month, replace 20% of your test cases with fresh examples from real user interactions. Prioritize cases where users expressed dissatisfaction, those are the scenarios your eval suite is probably missing.

Lie 1: Vacuum Testing

Tests output correctness without knowing who the user is. Like rating car safety with one size of crash test dummy. Accurate for one profile, misleading for all others.

Lie 2: Averaging Over Distribution

Reports 0.89 average while 30% of users experience 0.75 quality. The churning users are invisible in the average. The tails matter more than the middle.

Lie 3: Temporal Drift

Test cases from six months ago test scenarios the AI has been optimized for, not what users encounter today. Scores remain high while real quality degrades.

Evals That Lie

  • ×Tests output correctness in isolation
  • ×Reports average quality across all users
  • ×Uses test cases from six months ago
  • ×No user context in evaluation criteria
  • ×Dashboard shows 0.89 while users churn

Evals That Tell the Truth

  • Tests output quality in user context
  • Reports quality distribution, not just averages
  • Refreshes 20% of test cases monthly
  • Self-models provide user context to evaluation
  • Dashboard shows quality per user segment

The Alignment Gap

The gap between eval scores and user satisfaction has a name: the alignment gap. It exists because your evals measure output quality (is this response correct?) while your users experience alignment quality (does this AI understand me?).

Output quality is necessary but not sufficient. A perfectly correct response that is pitched at the wrong level for the user, that ignores their stated goals, or that contradicts advice given in a previous session is a quality response to the wrong person.

Alignment quality asks a different question: given what we know about this user, their expertise, their goals, their communication preferences, their history: how well did the AI serve them? This is a fundamentally different measurement from correctness.

The alignment gap is measurable. Compare your output quality score (what your evals report) with your user satisfaction score (what users report). The difference is the alignment gap. A large gap means your evals are measuring the wrong thing. A small gap means your evals are well-calibrated to user experience.

alignment-aware-eval.ts
1// Close the alignment gap: eval with user contextSame test, different expectations per user
2const userModel = await clarity.getSelfModel(userId);
3
4const evaluation = await clarity.evaluate({
5 response: aiOutput,
6 query: userQuery,
7 userContext: userModel,
8 dimensions: {
9 correctness: 'Is the information accurate?',
10 relevance: 'Does it address this user\'s actual intent?',
11 fit: 'Is the depth/tone/style right for THIS user?',
12 coherence: 'Is it consistent with previous interactions?',
13 },
14});
15
16// Now your eval score correlates with user satisfaction

How to Stop the Lying

Step 1: Measure the gap. Take 50 recent interactions where you have both eval scores and user satisfaction signals (thumbs up/down, completion rates, follow-up behavior). Calculate the correlation between your eval scores and user satisfaction. If the correlation is below 0.7, your evals are significantly disconnected from user experience.

Step 2: Segment your quality data. Instead of reporting a single quality number, break it down by user type: new versus returning users, expert versus novice, high-engagement versus low-engagement. The segmented view will reveal which user populations your evals accurately represent and which are invisible.

Step 3: Add user context to a subset of evals. You do not need to overhaul your entire eval system. Start by adding user context, expertise level, goals, preferences, to 20% of your eval cases. Compare the contextualized scores to the generic scores. The gap reveals how much personalization quality your current evals miss.

Step 4: Track the alignment gap over time. Make the gap between eval scores and user satisfaction a first-class metric. Report it alongside quality scores. When the gap widens, your evals need recalibration. When it narrows, your quality measurement is improving.

Step 1: Measure the Gap

Pull 50 interactions with both eval scores and user satisfaction signals. Calculate the correlation. Below 0.7 means your evals are significantly disconnected from user experience.

Step 2: Segment Your Quality Data

Break down by user type: new vs returning, expert vs novice, high-engagement vs low-engagement. The segmented view reveals which populations your evals accurately represent.

Step 3: Add User Context to a Subset

Add expertise level, goals, and preferences to 20% of eval cases. Compare contextualized scores to generic scores. The gap reveals how much personalization quality your current evals miss.

Step 4: Track the Alignment Gap Over Time

Make the gap between eval scores and user satisfaction a first-class metric. When the gap widens, recalibrate evals. When it narrows, your quality measurement is improving.

What Evals MeasureWhat Users ExperienceThe Gap
Output correctnessWas it useful to me?Intent alignment
Average qualityMy personal qualityDistribution blindness
Performance on test setPerformance on my use casesTemporal drift
Generic quality criteriaQuality for my contextPersonalization blindness

Trade-offs

Adding user context to evals increases complexity. Each eval case now requires user context alongside the input and expected output. This is more work to maintain, but it is the only way to measure what users actually experience.

Distribution analysis requires more data. You cannot segment quality by user type with 50 test cases. You need volume to see meaningful distributions. This is where LLM-as-judge or automated scoring becomes essential.

Continuous eval maintenance is ongoing work. Refreshing 20% of test cases monthly requires discipline. It is tempting to let the eval suite stagnate. Budget the time as non-negotiable quality infrastructure.

The alignment gap is uncomfortable. When you first measure it, the gap will be larger than you expect. This can feel like a failure. It is not, it is visibility. You cannot close a gap you cannot see.

What to Do Next

  1. Calculate your alignment gap this week. Pull 50 interactions with both eval scores and user satisfaction signals. Calculate the correlation. If it is below 0.7, your evals are lying to you about user experience. This single number will change how you think about quality.

  2. Segment your quality data. Break your quality scores by user type. Find the populations where quality is lowest. These are the users churning, and your average is hiding them.

  3. Add user context to your next eval cycle. For your next round of human review or LLM-as-judge evaluation, include user context, even basic attributes like expertise level and primary goal. Clarity provides the self-model API that makes user-aware evaluation possible. Watch how the scores change when the evaluator knows who the output was for.


Your evals say quality is 0.89. Your users say otherwise. Believe the users. Close the alignment gap.

References

  1. Hamel Husain argues in his eval framework
  2. only 1 in 26 unhappy customers actually complains
  3. not a reliable predictor of customer retention
  4. sampling bias, non-response bias, cultural bias, and questionnaire bias
  5. Qualtrics notes in their churn prediction framework
  6. NPS does not correlate with renewal or churn

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →