Skip to main content

The AI Product Health Check

Most AI products do not know whether they are healthy. Here is a 15-minute diagnostic that tells you whether your AI product is improving, plateauing, or silently degrading, before your users tell you.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 6 min read

TL;DR

  • Most AI products are silently degrading while vanity metrics (NPS, DAU) look stable, the degradation is masked by new user influx replacing churning existing users
  • A 15-minute health check using five diagnostic questions reveals whether your AI product is improving, plateauing, or dying
  • Running this health check monthly catches quality regressions before users notice and before the damage compounds

An AI product health check is a 15-minute diagnostic using five vital signs that reveals whether the product is improving, plateauing, or silently degrading before users notice. Most AI products show stable vanity metrics (NPS, MAU) while quality degrades underneath, because new user influx masks the churn of dissatisfied existing users. This post covers the five vital signs to check monthly, how to run the diagnostic step by step, and how to automate continuous health monitoring.

0 min
to run the full health check
0
diagnostic questions
0%
quality degradation hidden by vanity metrics
0
regressions caught before users noticed

The Five Vital Signs

Your AI product has five vital signs. Check them monthly. If any one is trending in the wrong direction, you have a problem that needs attention before it becomes a crisis.

Vital Sign 1: Cohort Quality Trend. Is the AI getting better for users who have been on the platform for 30, 60, and 90 days? This is the single most important diagnostic. If quality improves with usage, your product is healthy. If quality stays flat or declines, your product is not learning from its users.

Vital Sign 2: Alignment Score Distribution. What percentage of users have alignment scores above 0.7? Above 0.5? Below 0.3? The distribution matters more than the average. A bimodal distribution, many users above 0.7 and many below 0.3, indicates the product works well for some user types and poorly for others.

Vital Sign 3: Failure Mode Coverage. What percentage of messy inputs are handled gracefully? Sample 20 interactions where the AI produced a poor response. Categorize the failure. If 80 percent of failures fall into 3 categories, those categories are your highest-priority fixes.

Vital Sign 4: Confidence Calibration. When your AI is 90 percent confident, is it correct 90 percent of the time? If it is confident but wrong, users learn not to trust it. If it is uncertain but right, users miss good answers. Both are health problems.

Vital Sign 5: Recovery Rate. After a bad interaction, what percentage of users try again? A high recovery rate means users trust the product enough to give it another chance. A low recovery rate means a single failure drives users away.

Vital 1: Cohort Quality Trend

Is the AI getting better for 30, 60, and 90-day users? If quality stays flat or declines, the product is not learning from its users.

Vital 2: Alignment Distribution

The distribution matters more than the average. Bimodal means the product works for some user types and poorly for others.

Vital 3: Failure Mode Coverage

If 80% of failures fall into 3 categories, those categories are your highest-priority fixes. Sample 20 poor interactions and categorize.

Vital 4: Confidence Calibration

When 90% confident, is it correct 90% of the time? Confident-but-wrong destroys trust. Uncertain-but-right means missed opportunities.

Vital 5: Recovery Rate

After a bad interaction, what percentage of users try again within 24 hours? High recovery means users trust the product enough to give it another chance.

Vanity Metrics Dashboard

  • ×NPS (does not distinguish new from existing users)
  • ×MAU (masks churn with new user acquisition)
  • ×Revenue (lagging indicator by 30 to 90 days)
  • ×Average satisfaction (averages hide distributions)

Diagnostic Health Check

  • Cohort quality trend (is the product learning per user?)
  • Alignment score distribution (who is underserved?)
  • Failure mode coverage (what breaks most often?)
  • Confidence calibration (does the product know what it knows?)

Running the Health Check

Here is the step-by-step process. It takes 15 minutes once you have the data pipelines in place.

Step 1: Pull cohort quality data (3 minutes). Compare the average response quality for users at 30, 60, and 90 days. If 90-day users get better responses than 30-day users, the product is learning. If not, it is stagnant.

Step 2: Plot the alignment distribution (3 minutes). Histogram of alignment scores across your user base. Look for bimodality. Identify the user profiles with the lowest scores. These are your highest-churn-risk segments.

Step 3: Sample failure modes (3 minutes). Pull 20 recent poor interactions. Categorize each failure. Count categories. The top 3 categories are your prioritized fix list.

Step 4: Check confidence calibration (3 minutes). For interactions where the AI expressed high confidence, what percentage were actually correct? For low-confidence interactions, what percentage were actually wrong? Plot expected confidence versus actual correctness.

Step 5: Measure recovery rate (3 minutes). After a flagged poor interaction, what percentage of users returned for another interaction within 24 hours? This is your resilience metric.

Step 1: Pull Cohort Quality Data (3 min)

Compare average response quality for users at 30, 60, and 90 days. If 90-day users get better responses than 30-day users, the product is learning.

Step 2: Plot Alignment Distribution (3 min)

Histogram of alignment scores. Look for bimodality. Identify user profiles with lowest scores as highest churn-risk segments.

Step 3: Sample Failure Modes (3 min)

Pull 20 recent poor interactions. Categorize each failure. The top 3 categories are your prioritized fix list.

Step 4: Check Confidence Calibration (3 min)

For high-confidence interactions, what percentage were correct? For low-confidence ones, what percentage were wrong? Plot expected vs actual correctness.

Step 5: Measure Recovery Rate (3 min)

After a flagged poor interaction, what percentage of users returned within 24 hours? This is your product resilience metric.

health-check.ts
1// Automated health check, run monthly15 minutes
2const healthReport = await clarity.runHealthCheck({
3 cohortQuality: {vital sign 1
4 cohorts: ['30d', '60d', '90d'],
5 metric: 'response_quality_trend'
6 },
7 alignmentDistribution: {vital sign 2
8 buckets: [0.3, 0.5, 0.7, 0.9]
9 },
10 failureModes: {vital sign 3
11 sampleSize: 20,
12 categorize: true
13 },
14 confidenceCalibration: {vital sign 4
15 threshold: 0.9
16 },
17 recoveryRate: {vital sign 5
18 windowHours: 24
19 }
20});
21// Returns: { overall: 'healthy' | 'degrading' | 'critical', details: [...] }

Interpreting the Results

All green. Cohort quality trending up, alignment distribution shifting right, failure modes decreasing, confidence well-calibrated, high recovery rate. Your product is healthy. Keep doing what you are doing. Run the check again next month.

Mixed signals. Some vital signs healthy, others concerning. This is the most common result and the most actionable. Prioritize the worst vital sign. Fix it. Re-check in two weeks.

All red. Cohort quality flat or declining, alignment distribution bimodal or left-skewed, failure modes increasing, confidence miscalibrated, low recovery rate. Your product is in trouble. This is not a feature problem: it is an architectural problem. You need to invest in the understanding layer before shipping more features.

The most dangerous result is not all red, it is mixed signals with one hidden red. A product can be healthy on 4 vital signs and dying on 1. If that 1 is cohort quality trend, the death is slow enough to miss and fast enough to kill.

All Green

Cohort quality trending up, alignment shifting right, failures decreasing, confidence calibrated, high recovery. Product is healthy. Re-check next month.

Mixed Signals

Some vital signs healthy, others concerning. The most common and actionable result. Prioritize the worst vital sign. Fix it. Re-check in two weeks.

All Red

Quality flat or declining, alignment left-skewed, failures increasing. This is an architectural problem. Invest in the understanding layer before shipping more features.

Vital SignHealthy SignalWarning SignalCritical Signal
Cohort qualityImproving at 30, 60, 90 daysFlat across cohortsDeclining with tenure
Alignment distributionRight-skewed (most above 0.7)Normal (centered at 0.5)Left-skewed or bimodal
Failure modesTop 3 categories shrinkingStable failure categoriesGrowing failure categories
Confidence calibration90 percent confident = 90 percent correct10 point gap20 plus point gap
Recovery rateAbove 80 percent60 to 80 percentBelow 60 percent

Automating the Check

Do not rely on humans to run health checks. Humans forget. Automate.

Build a daily automated check that monitors all five vital signs and alerts when any crosses a threshold. Set the thresholds conservatively at first, you want to catch issues early, even at the cost of some false alarms. Tune the thresholds over time as you learn what your product’s normal ranges are.

The automated check should produce a daily report that lives on a dashboard everyone can see. Make it impossible to miss a degrading vital sign by making it visible.

At Clarity, our automated health monitor runs every 6 hours and has caught 3 quality regressions before any user noticed. Each regression would have taken weeks to identify through user complaints. The health check caught them in hours.

Trade-offs

Implementing a health check system has costs:

Data pipeline investment. The five vital signs require data infrastructure, cohort tracking, alignment scoring, failure categorization, confidence calibration, recovery tracking. If you do not already have these pipelines, building them takes 2 to 4 weeks.

Alert fatigue. Automated health checks generate alerts. Too many alerts and the team stops paying attention. Start with conservative thresholds and tune aggressively to reduce false positives.

Measurement overhead. Some vital signs require computation on every interaction, alignment scoring, confidence calibration. This adds latency and cost. Profile the impact and optimize the critical path.

Action requirements. A health check is only useful if you act on the results. If a vital sign goes red and nobody has time to investigate, the check creates anxiety without improvement. Budget time for follow-up before implementing the check.

Metric interpretation. The five vital signs require context to interpret. A declining cohort quality trend might mean the model regressed, or it might mean you acquired a new user segment that the model has not seen before. Build interpretation guides alongside the metrics.

What to Do Next

1. Run the health check manually this week. Pull the data for all five vital signs. Spend 15 minutes reviewing the results. Even without automation, the manual check will reveal things your dashboard hides.

2. Automate vital sign 1 first. Cohort quality trend is the most important vital sign and the easiest to automate. Track average response quality for 30, 60, and 90-day user cohorts. Alert when the trend flattens or declines.

3. Schedule the monthly check. Put a recurring 30-minute meeting on the calendar for the first Monday of every month. In that meeting, review all five vital signs, identify the most concerning trend, and assign one person to investigate. Do this for 3 months. By month 4, it will be the most valuable meeting on your calendar.


Stop flying blind. Start checking your AI product’s vital signs. Get automated health monitoring with Clarity.

References

  1. only 1 in 26 unhappy customers actually complains
  2. Qualtrics notes in their churn prediction framework
  3. Research from Epsilon
  4. Bain & Company research
  5. lagging indicators

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →