The Alignment Score: A Better Metric Than Accuracy

Accuracy measures whether the AI got the right answer. Alignment measures whether it understood the question, and the person asking it. The metric that predicts retention better than any other.

Robert Ta's Self-Model CEO & Co-Founder 847 beliefs

· March 14, 2026 · 7 min read

TL;DR

Accuracy measures whether the AI got the right answer but ignores whether the answer served the specific user, you can have 94% accuracy and still churn users who feel misunderstood
The alignment score composites four dimensions, relevance, coherence, depth, and fit, to capture the full quality experience, including the personalization that accuracy misses
In practice, alignment score is a stronger predictor of retention than accuracy because it measures the user’s experience, not just the output’s correctness

The alignment score is a composite metric that measures how well AI output serves a specific user, not just whether the output is factually correct. Products with 94% accuracy still lose users because accuracy ignores relevance, coherence, depth, and fit, the dimensions that actually drive retention. This post covers how the alignment score works, why it predicts retention 2.3x better than accuracy alone, and how to implement it alongside your existing evaluation infrastructure.

accuracy did not prevent churn

30-day retention (and dropping)

dimensions in the alignment score

better retention predictor than accuracy

What Alignment Measures That Accuracy Misses

The alignment score is a composite metric built from four dimensions. Each dimension captures something that accuracy alone ignores.

Relevance measures intent match. Did the AI address what the user actually needed, not just what they literally asked? A user who asks What is the return policy?” might actually need help deciding whether to return a specific item. Accuracy measures whether the AI correctly stated the return policy. Relevance measures whether it helped with the underlying decision.

Coherence measures consistency over time. Does the AI maintain stable reasoning across interactions? Does it build on previous conversations? An AI that recommends aggressive investing in one session and conservative investing in the next, both accurate in isolation, scores low on coherence. Users experience this as the AI not knowing them.

Depth measures insight value. Did the AI tell the user something they did not already know? A financially literate user asking about portfolio allocation does not need a definition of diversification, they need insight about their specific situation. Accuracy does not distinguish between restating common knowledge and providing genuine analysis.

Fit measures personalization quality. Did the response match this specific user’s expertise level, communication style, and current goals? This is the dimension that accuracy misses entirely. The same correct answer can be excellent for one user and terrible for another, depending on who they are.

Relevance: Intent Match

Did the AI address what the user actually needed, not just what they literally asked? Captures the gap between the question and the real need.

Coherence: Consistency Over Time

Does the AI maintain stable reasoning across interactions? Contradictory advice session to session destroys trust.

Depth: Insight Value

Did the AI go beyond restating common knowledge? Distinguishes genuine analysis from parroting known information.

Fit: Personalization Quality

Did the response match this user’s expertise, style, and goals? The same correct answer can be excellent or terrible depending on the user.

Accuracy-Only View

×Response was factually correct: pass
×Model answered the question asked: pass
×No hallucinations detected: pass
×Overall accuracy: 94%
×Why are users leaving?

Alignment Score View

✓Relevance: 0.82, addressed the real need, not just the literal question
✓Coherence: 0.78, mostly consistent but one contradiction detected
✓Depth: 0.65, restated known information instead of providing insight
✓Fit: 0.53, too technical for this user's expertise level
✓Alignment: 0.70, depth and fit are driving churn

The Math Behind the Alignment Score

The alignment score is a weighted composite of the four dimensions. The weights reflect your product’s priorities.

For a financial advisory AI: Relevance 0.25, Coherence 0.25, Depth 0.30, Fit 0.20. Depth is weighted highest because users come to the product for insight, restating known information is the fastest path to churn.

For a customer support AI: Relevance 0.35, Coherence 0.20, Depth 0.15, Fit 0.30. Relevance is weighted highest because users need their specific issue addressed. Fit is weighted second because the response needs to match the user’s technical sophistication.

For a personalized learning AI: Relevance 0.20, Coherence 0.25, Depth 0.25, Fit 0.30. Fit is weighted highest because the entire value proposition is adaptation to the learner’s level.

Financial Advisory AI

Depth weighted highest (0.30). Users come for insight. Restating known information is the fastest path to churn.

Customer Support AI

Relevance weighted highest (0.35). Users need their specific issue addressed, matched to their technical level.

Personalized Learning AI

Fit weighted highest (0.30). The entire value proposition is adaptation to the individual learner’s level.

The weights are product decisions, not mathematical constants. They encode what matters to your users. Getting them right requires the same kind of calibration that Hamel Husain describes [1] for eval systems: start with a hypothesis, measure against user outcomes, adjust.

alignment-score.ts

1// Calculate the alignment score← The metric that predicts retention
2const alignment = await clarity.getAlignmentScore(userId, {
3  interaction: sessionId,
4  weights: {
5    relevance: 0.25,
6    coherence: 0.25,
7    depth: 0.30,
8    fit: 0.20,
9  },
10  userContext: await clarity.getSelfModel(userId),
11});
12
13// Returns:← Decomposable, trackable, actionable
14// { overall: 0.70, relevance: 0.82, coherence: 0.78,
15//   depth: 0.65, fit: 0.53,
16//   weakest: 'fit', retentionRisk: 'high' }

Why Alignment Predicts Retention Better

The correlation between accuracy and retention is moderate, around 0.4-0.5 in the products I have studied. The correlation between alignment score and retention is strong, around 0.7-0.8.

Why the difference? Because accuracy is a property of the output. Retention is a property of the relationship. And relationships are built on more than correctness.

Consider two friends who give you advice. Friend A always has the right answer but delivers it in a way that makes you feel stupid. Friend B sometimes gets details wrong but always understands what you are going through and tailors their advice to your situation. Most people retain Friend B and stop calling Friend A.

AI products work the same way. Users do not retain products that are correct. They retain products that understand them. The alignment score measures understanding. Accuracy measures correctness. They correlate, but they are not the same thing.

The practical implication is clear: if you optimize only for accuracy, you will hit a quality ceiling that does not translate to retention. If you optimize for alignment, accuracy becomes one component of a metric that actually predicts whether users stay.

Implementing the Alignment Score

Step 1: Measure what you have. You probably already measure some form of accuracy. Add the three missing dimensions, relevance, coherence, and fit, using the same eval infrastructure. For relevance and coherence, you can start with LLM-as-judge approaches. For fit, you need user context.

Step 2: Calculate the composite. Weight the four dimensions based on your product priorities. Start with equal weights (0.25 each) and adjust based on which dimensions correlate most with your retention data.

Step 3: Track the alignment-retention correlation. Every month, calculate the correlation between alignment scores and user retention. This tells you whether your alignment measurement is calibrated correctly. If the correlation is below 0.6, your dimension weights or scoring criteria need adjustment.

Step 4: Report alignment alongside accuracy. Do not replace accuracy with alignment, report both. The accuracy score gives your engineering team a technical quality metric. The alignment score gives your product team a user experience metric. The gap between them is the improvement opportunity.

Step 1: Measure What You Have

Add relevance, coherence, and fit to your existing accuracy measurements. Use LLM-as-judge for relevance and coherence; fit requires user context.

Step 2: Calculate the Composite

Weight the four dimensions based on product priorities. Start with equal weights (0.25 each) and adjust based on retention correlation data.

Step 3: Track Alignment-Retention Correlation

Monthly correlation checks between alignment scores and retention. Below 0.6 means your weights or scoring criteria need recalibration.

Step 4: Report Both Metrics

Accuracy for engineering (technical quality). Alignment for product (user experience). The gap between them is your improvement opportunity.

Metric	What It Measures	Retention Correlation	Actionability
Accuracy	Output correctness	0.4-0.5	Fix model and retrieval
User satisfaction (survey)	Self-reported experience	0.5-0.6	General direction
Alignment score	User-output fit	0.7-0.8	Specific dimensions to improve
Engagement metrics	Usage patterns	0.3-0.4	Misleading for quality

Trade-offs

The alignment score requires more infrastructure than accuracy. Accuracy can be measured with test cases and reference answers. Alignment requires user context (self-models), multi-dimensional evaluation, and weight calibration. The investment is worth it for the predictive power, but it is not trivial.

Dimension weights are judgment calls. There is no objectively correct weighting. Two reasonable product leaders can disagree about whether depth should be 0.25 or 0.35. The weights should be informed by retention data, not guesswork, but the initial calibration requires product judgment.

The fit dimension is hard to bootstrap. New users have no self-model. Their fit score is essentially unmeasured until you accumulate enough interaction data. Plan for a cold-start period where fit is either excluded from the alignment score or estimated from demographic proxies.

Alignment is not a replacement for accuracy. A product with high alignment and low accuracy has a different problem, it understands users well but gives them wrong information. Both metrics are necessary. Alignment adds the dimension that accuracy is missing.

What to Do Next

Calculate accuracy-retention correlation. Pull your accuracy scores and 30-day retention rates. Calculate the correlation. If it is below 0.5, accuracy is not driving retention, and you need a metric that does.
Add one dimension to your eval. Start with relevance, it is the easiest to measure and the most impactful after accuracy. Score 50 interactions on relevance using a simple rubric (1-5: did the AI address the user’s actual intent?). See how it correlates with satisfaction.
Plan for the fit dimension. Fit is the highest-leverage dimension but requires user context. Even before you build self-model infrastructure, start tracking cases where the AI’s response did not match the user’s level or needs. This qualitative data will inform your alignment score design. Clarity provides the self-model API that makes fit scoring automated.

Accuracy tells you the AI got the right answer. Alignment tells you it understood the person. Measure what matters for retention.

References

Building AI that needs to understand its users?

Talk to us →

◉The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →

Evaluation Beyond BLEU: Measuring AI Alignment

BLEU scores and perplexity do not measure whether your AI is aligned with user intent. Alignment metrics built on self-models capture what traditional evaluation misses.

Robert Ta's Self-Model

14 min read