AI Product Teams Are Measuring the Wrong Things
Most AI product teams track accuracy, F1 scores, and latency. Users care about none of these. The metrics that predict retention measure understanding, not performance - and almost nobody is tracking them.
TL;DR
- AI product teams inherit metrics from two worlds - ML research metrics (accuracy, F1, perplexity) and SaaS metrics (DAU, NPS, MRR) - and neither set captures what actually drives AI product retention: user understanding
- The metrics that predict retention are alignment metrics: intent match rate, preference adaptation speed, belief consistency, and understanding depth - all of which measure whether the AI knows the user, not whether it performs well on benchmarks
- Teams that shift from performance metrics to understanding metrics detect retention problems 3 weeks earlier and make better product decisions because they optimize for the right outcome
AI product teams are measuring the wrong things when they track accuracy, F1 scores, and latency instead of understanding metrics like intent match rate and belief consistency. These model performance metrics have less than 0.3 correlation with user retention, while understanding metrics correlate above 0.7. This post covers the metrics gap between ML teams and product teams, the four understanding metrics every AI product should track, and how to assign ownership so the most predictive metrics stop falling through the cracks.
The Two Metrics Worlds
AI product teams live in two metrics worlds that rarely communicate.
The ML world measures model quality through research-derived metrics. Accuracy: how often is the output correct? F1 score: how well does the model balance precision and recall? Perplexity: how surprised is the model by the test data? BLEU and ROUGE: how closely does the output match reference texts? These metrics are well-defined, reproducible, and easy to compute.
The SaaS world measures business outcomes through growth-derived metrics. Daily active users: how many people use the product? Monthly recurring revenue: how much are they paying? Net promoter score: would they recommend it? Churn rate: how many are leaving?
Both measurement systems are internally coherent. The ML metrics tell you about model capability. The SaaS metrics tell you about business health. But neither tells you about the bridge between them: the user’s experience of being understood.
This gap exists because understanding is hard to measure and does not fit neatly into either world. It is not a model metric - you cannot compute “understanding” from a test set. It is not a traditional SaaS metric - it is too granular and too personal for monthly surveys.
So it goes unmeasured. And unmeasured means unmanaged. And unmanaged means your product optimizes for metrics that do not predict the outcome you care about.
What AI Teams Measure
- ×Accuracy / F1 score (model correctness)
- ×Latency / throughput (system performance)
- ×DAU / MAU (engagement volume)
- ×NPS / CSAT (periodic satisfaction)
What AI Teams Should Measure
- ✓Intent match rate (does the AI understand what users want?)
- ✓Preference adaptation speed (how fast does it learn?)
- ✓Belief consistency (does it respect what users have told it?)
- ✓Understanding depth (how well does it know each user over time?)
The Understanding Metrics Stack
Understanding metrics sit between model metrics and business metrics. They measure the quality of the relationship between the AI and each user - not in aggregate, but per-person.
Intent match rate measures how often the AI correctly identifies what the user is trying to accomplish. Not whether the output is accurate - whether it addresses the right problem. A user who asks for help writing an email and receives a grammar check has experienced an intent mismatch, even if the grammar check was technically perfect.
Preference adaptation speed measures how quickly the AI learns individual user preferences. After how many interactions does the AI produce output in the user’s preferred tone? After how many sessions does it stop suggesting things the user has already rejected? This metric directly measures learning - and learning is what users expect from AI.
Belief consistency measures whether the AI respects the user’s stated and observed beliefs. If a user has expressed a preference for conservative approaches, does the AI consistently honor that? Or does it occasionally suggest aggressive alternatives, creating a feeling of misunderstanding?
Understanding depth is a composite metric that captures how well the AI’s internal model of the user matches reality. It improves over time as the self-model accumulates observations and refines beliefs. A user with high understanding depth experiences an AI that “gets” them - and that feeling is the strongest predictor of retention.
Layer 1: Intent Match Rate
How often the AI correctly identifies what the user is trying to accomplish. A grammar check for someone asking for email help is an intent mismatch, even if technically perfect.
Layer 2: Preference Adaptation Speed
How quickly the AI learns individual preferences. After how many interactions does it produce output in the user’s preferred tone? This metric directly measures learning.
Layer 3: Belief Consistency
Whether the AI respects stated and observed beliefs. If a user prefers conservative approaches, does the AI consistently honor that preference?
Layer 4: Understanding Depth
Composite metric capturing how well the AI’s internal model of the user matches reality. The strongest predictor of retention. Improves as the self-model accumulates observations.
1// The metrics stack your AI product is missing← Understanding metrics2const metrics = await clarity.getUnderstandingMetrics(userId);34// Level 1: Intent Match← Did we understand the request?5// metrics.intentMatch: 0.83 (83% of requests correctly interpreted)67// Level 2: Preference Adaptation← How fast are we learning?8// metrics.preferenceAdaptation: 0.67 (adapted to 67% of observed preferences)910// Level 3: Belief Consistency← Do we respect stated values?11// metrics.beliefConsistency: 0.79 (consistent with user beliefs 79% of time)1213// Level 4: Understanding Depth (composite)← How well do we know this user?14// metrics.understandingDepth: 0.74 (after 23 sessions, 156 observations)1516// Retention prediction based on understanding metrics17// metrics.retentionRisk: 'low' (understanding depth > 0.7 = 89% 90-day retention)
| Metric Layer | What It Measures | Correlation with Retention | Who Owns It Today |
|---|---|---|---|
| Model metrics (F1, accuracy) | Model capability | 0.23 | ML team |
| System metrics (latency, throughput) | Infrastructure health | 0.15 | Engineering team |
| Engagement metrics (DAU, sessions) | Usage volume | 0.44 | Product team |
| Understanding metrics (alignment, adaptation) | User relationship quality | 0.71 | Nobody |
The Ownership Problem
The metrics gap is partly an organizational problem. Model metrics are owned by the ML team. System metrics are owned by engineering. Engagement metrics are owned by product. Understanding metrics are owned by nobody.
This ownership vacuum exists because understanding metrics span all three domains. Computing intent match rate requires ML infrastructure. Measuring preference adaptation requires product instrumentation. Tracking belief consistency requires user modeling infrastructure that does not exist in most organizations.
The result is that the most important metrics for AI product success fall into an organizational gap. Everyone assumes someone else is tracking them. Nobody is.
The fix is explicit ownership. Someone - typically a product manager with technical depth or a technical program manager - needs to own the understanding metrics stack. Not build it alone, but own the outcome: these metrics exist, they are accurate, and they are reviewed weekly.
ML Team Owns
Model metrics: accuracy, F1, perplexity, BLEU. These measure model capability against benchmarks. Correlation with retention: 0.23.
Product Team Owns
Engagement metrics: DAU, sessions, NPS, MRR. These measure business health. Correlation with retention: 0.44.
Nobody Owns (Yet)
Understanding metrics: intent match, preference adaptation, belief consistency, depth. Correlation with retention: 0.71. The most predictive metrics fall into an organizational gap.
Trade-offs
Understanding metrics are expensive to compute. Unlike model accuracy, which can be computed from a test set, understanding metrics require per-user models and continuous evaluation. The investment is real - but so is the cost of optimizing for metrics that do not predict retention.
Defining “understanding” is inherently subjective. Two reasonable people might disagree about whether an AI “understood” a user in a specific interaction. The mitigation is using multiple proxy signals and looking at trends rather than individual measurements.
Adding new metrics creates dashboard fatigue. Teams already drowning in metrics will resist adding more. The solution is not adding understanding metrics alongside existing metrics - it is replacing low-value metrics with high-value ones. F1 score improvements that do not correlate with retention are not worth tracking weekly.
What to Do Next
Week 1: Compute the Correlation
Take your last 6 months of data and compute the correlation between model metrics and retention. If below 0.4, you have proof you are measuring the wrong things.
Week 2: Instrument Intent Match
Start with intent match rate. After each AI interaction, capture whether the user engaged with or rejected the output. Track the ratio weekly.
Week 3: Assign Ownership
Choose one person to own the understanding metrics stack. Give them the mandate to define, instrument, and report. Review in every product meeting.
-
Compute the correlation. Take your last 6 months of data and compute the correlation between your model metrics (accuracy, F1) and your retention metrics (90-day retention, churn rate). If the correlation is below 0.4 - and I predict it will be - you have proof that you are measuring the wrong things.
-
Instrument one understanding metric. Start with intent match rate - it is the easiest to measure. After each AI interaction, capture whether the user engaged with the output (used it, built on it, asked follow-up questions) or rejected it (discarded it, rephrased, expressed frustration). The ratio is a rough intent match rate. Track it weekly.
-
Assign ownership. Choose one person to own the understanding metrics stack. Give them the mandate to define, instrument, and report on metrics that measure user understanding - not model performance. Review understanding metrics in every product meeting alongside engagement and revenue metrics.
References
- only 1 in 26 unhappy customers actually complains
- Qualtrics notes in their churn prediction framework
- Research from Epsilon
- Bain & Company research
- lagging indicators
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →