Skip to main content

Measuring What Matters: Beyond Accuracy and Engagement

Accuracy and engagement are the default metrics for AI products. But accuracy does not measure user value and engagement does not measure satisfaction. Here are the metrics that actually predict AI product success.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 7 min read

TL;DR

  • Accuracy (0.23 correlation with retention) and engagement (0.31 correlation) are weak predictors of AI product success - alignment score (0.78 correlation) is the metric that actually matters
  • Engagement optimization without value alignment backfires: one product increased DAU 22% while satisfaction dropped 15% and retention declined 8%
  • Alignment score - the degree to which each output matches what the specific user needs - is the single best predictor of long-term AI product success, and almost nobody measures it

Measuring what matters in AI products means tracking alignment score, not accuracy or engagement, as the primary predictor of retention. Accuracy (0.23 correlation with retention) and engagement (0.31 correlation) are weak predictors that create false confidence while users churn silently. This post covers the full metric framework, why engagement optimization backfires, and how to implement alignment-first metrics that actually predict whether your AI product will succeed.

0
accuracy-retention correlation
0
engagement-retention correlation
0
alignment-retention correlation
0
metrics compared against retention

Why Accuracy Is a Weak Predictor

Accuracy measures whether the output is correct. It does not measure whether the output is useful.

Consider an AI assistant that generates code. If I ask for a function to sort an array and it gives me a textbook-perfect quicksort implementation, the accuracy is 100%. But if I am a junior developer who needs to understand the code, and the implementation uses advanced language features I have never seen, the output is accurate and useless. If I am a senior developer looking for a production-ready solution, and the implementation lacks error handling and edge case coverage, it is accurate and incomplete.

Accuracy is context-independent. Value is context-dependent. The gap between them is the gap between what your model produces and what your user needs.

This does not mean accuracy is irrelevant - obviously, wrong outputs are bad. It means accuracy is a necessary floor, not a sufficient predictor. Above the floor, improvements in accuracy have diminishing returns for user satisfaction. The marginal value of going from 92% to 95% accuracy is far less than the marginal value of going from 60% to 80% alignment.

Standard AI Metrics

  • ×Accuracy on held-out test sets
  • ×DAU/MAU engagement ratios
  • ×NPS surveys quarterly
  • ×Feature adoption percentages

Alignment-First Metrics

  • Per-user alignment score on every output
  • Value delivery rate (outputs user found useful)
  • Trust trajectory (does trust increase over time?)
  • Switching cost growth (is understanding compounding?)

The Engagement Trap

I studied an AI product that decided to optimize for engagement. Their product team set DAU/MAU as their north star metric and designed experiments to increase it.

They succeeded. DAU increased 22% over three months through notification frequency increases, feature prompts, and in-app nudges to return. The dashboard looked great. The engagement team celebrated.

In the same period, user-reported satisfaction dropped 15%. Ninety-day retention declined 8%. The product was driving usage without driving value. Users were coming back more often, staying for shorter sessions, and accomplishing less per session. The engagement optimization had turned the product into an annoyance that users tolerated rather than a tool they valued.

This is the engagement trap: engagement can increase even as value decreases. The metrics diverge because engagement measures frequency of use and value measures quality of use. Optimizing for frequency without optimizing for quality produces the worst possible outcome: users who interact more but trust less.

What Alignment Score Measures

Alignment score is not a single number. It is a multi-dimensional evaluation of whether each output served the specific user who received it.

Depth alignment: Did the explanation depth match the user’s expertise? Scored by comparing response complexity against the user’s demonstrated comprehension level.

Relevance alignment: Did the output address what the user actually needed? Scored by measuring task completion rate and follow-up query rate (fewer follow-ups = higher relevance).

Tone alignment: Did the communication style match the user’s context? A user in crisis mode needs different tone than a user in exploration mode.

Temporal alignment: Did the output account for what the user already knows? Measured by detecting repetition of previously covered content.

The composite alignment score weights these dimensions based on their importance for the specific interaction type and user context. The weights are not static - they are informed by the user’s self-model.

alignment-metrics.ts
1// Measure alignment, not just accuracyThe metric that predicts success
2const alignment = await clarity.measureAlignment({
3 output,
4 userId,
5 dimensions: {
6 depth: { weight: 0.30 }, // Matches expertise?
7 relevance: { weight: 0.35 }, // Addresses actual need?
8 tone: { weight: 0.15 }, // Right communication style?
9 temporal: { weight: 0.20 } // Builds on prior context?
10 }
11});
12
13// Track over time: does alignment improve?
14await clarity.trackMetric(userId, {
15 metric: 'alignment_score',
16 value: alignment.composite, // 0.83
17 dimensions: alignment.breakdown
18});
19// Alignment trending up = product getting better for this user

The Full Metric Framework

Based on the correlation study, here is the metric framework I recommend for AI products. It replaces the standard accuracy-engagement-NPS trinity with metrics that actually predict success.

MetricWhat It MeasuresCorrelation with RetentionMeasurement Method
Alignment scoreOutput-to-need match per user0.78Self-model-based evaluation
Value delivery ratePercentage of outputs user found useful0.71Implicit (task completion) + explicit (feedback)
Trust trajectoryIs user trust increasing over time?0.68Trust score change over rolling 30-day window
Switching cost growthIs accumulated understanding increasing?0.65Self-model depth and breadth over time
Time-to-valueHow quickly does the product deliver value?0.54Time from first interaction to first valuable output
Accuracy (baseline)Is the output factually correct?0.23Standard evaluation benchmarks
Engagement (DAU/MAU)How often do users return?0.31Standard product analytics

The top four metrics all require user understanding - you cannot measure alignment, value delivery, trust trajectory, or switching cost growth without knowing who the user is and what they need. This is why self-models are the prerequisite for meaningful AI product metrics, not just a personalization feature.

Implementing Alignment Metrics

Measuring alignment requires three things that most AI product teams do not currently have.

1. A user model. You cannot measure whether an output aligns with a user’s needs if you do not know what those needs are. Self-models provide the baseline against which alignment is measured.

2. Per-output evaluation. Unlike accuracy (which can be measured on a test set) and engagement (which can be measured in aggregate), alignment must be evaluated on every output for every user. This requires automated evaluation infrastructure, not periodic audits.

3. Feedback integration. Alignment measurement improves when it incorporates both implicit feedback (did the user complete their task?) and explicit feedback (did the user rate the output?). The feedback loop must be lightweight enough that users actually use it.

0
alignment-retention correlation (strongest)
0%
DAU increase that decreased satisfaction
0
metrics that require user understanding

Trade-offs

Alignment-first metrics are harder to implement and interpret than standard metrics.

Measurement complexity increases dramatically. Accuracy can be measured against a test set. Engagement can be measured with analytics. Alignment requires a self-model, per-output evaluation, and feedback integration. This is significantly more infrastructure.

Alignment is harder to A/B test. You can A/B test click-through rates. You cannot easily A/B test alignment because the metric is user-specific - the same output might be highly aligned for one user and poorly aligned for another. Experimentation requires controlling for user context.

Stakeholder communication is harder. “Our accuracy is 94%” is easy to communicate. “Our alignment score is 0.83 with depth as the weakest dimension” requires more explanation. Board members and investors are less familiar with alignment metrics than with accuracy and engagement.

There is no industry benchmark. Everyone knows what “good” DAU/MAU looks like. Nobody knows what a “good” alignment score is because the metric is novel. You will be establishing your own baselines without external reference points.

What to Do Next

  1. Correlate your existing metrics with retention. Take your current metrics (accuracy, engagement, NPS, whatever you track) and correlate each one with 90-day retention. If the correlations are weak (below 0.5), your metrics are not measuring what matters. This data will build the internal case for alignment-first metrics.

  2. Start measuring alignment on a sample. Pick 100 users and manually evaluate 10 outputs per user on the four alignment dimensions (depth, relevance, tone, temporal). Clarity automates alignment scoring through self-models. Compare the alignment scores with those users’ retention outcomes. The correlation will likely surprise you.

  3. Replace one vanity metric with one alignment metric. In your next team review, replace one engagement metric with alignment score or value delivery rate. Track both the old metric and the new one. Within 60 days, you will see which one better predicts the outcomes you care about.


Accuracy is a floor. Engagement is noise. Alignment is the signal. Start measuring what matters.

References

  1. not a reliable predictor of customer retention
  2. sampling bias, non-response bias, cultural bias, and questionnaire bias
  3. NPS does not correlate with renewal or churn
  4. Nielsen Norman Group has noted
  5. Research confirms

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →