Skip to main content

Measuring AI Product Quality When You Have No Baseline

You cannot improve what you cannot measure. But how do you measure AI quality when you have never measured it before? Here is how to create your first quality baseline from zero.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 7 min read

TL;DR

  • Most AI teams have never created a quality baseline because it feels like a massive infrastructure project - in reality, your first baseline takes 2 hours and 20 random interactions
  • The baseline exercise uses a four-dimension rubric (relevance, coherence, depth, fit) scored manually by three team members on a 1-5 scale, producing an immediately useful quality snapshot
  • Teams that create their first baseline discover their AI is 20-30% worse than they assumed - which is exactly the insight that unlocks improvement

Measuring AI product quality without a baseline starts with a simple 2-hour exercise: score 20 random interactions across four dimensions using three independent reviewers. Most teams delay this exercise for months because it feels like a massive infrastructure project, when in reality the first measurement is the single most valuable quality insight they will ever generate. This post covers the step-by-step process for creating your first baseline, interpreting the results, and building a monthly measurement cadence that drives improvement.

0hrs
time to create your first quality baseline
0
interactions to score
0
independent scorers needed
0mo
average time teams delay before starting

Step 1: Pull 20 Random Interactions

True random from your database or API logs. Include user input, AI response, and available context.

Step 2: Recruit 3 Scorers

People NOT on the AI product team. Customer success reps, other teams, or target persona matches.

Step 3: Score on 4 Dimensions

Each scorer rates every interaction on relevance, coherence, depth, and fit (1-5 scale).

Step 4: Calculate Baseline

Average scores per dimension, convert to 0-1 scale. Identify your weakest dimension.

Step 5: Check Scorer Agreement

If average difference is more than 1.5 points, revise the rubric. Agreement within 1 point means your baseline is reliable.

Step 1: Pull 20 Random Interactions

Not your best interactions. Not your worst. Not interactions you selected because they represent typical usage. True random.

Go to your database, chat logs, or API records and pull 20 interactions from the last 30 days at random. If you have a user ID list, use a random number generator to pick 20. If you have a chronological log, pick every Nth interaction.

Randomness matters because your perception of typical is biased. You remember the good demos and the bad bugs. The random sample shows you the actual distribution.

Each interaction should include the user’s input, the AI’s response, and any available context about the user (role, experience level, what they were trying to accomplish).

Step 2: Recruit Three Scorers

Find three people who are NOT on the AI product team. This is critical. Product team members have expertise that real users lack, and they will rate quality higher than users would.

Good scorers include: people from another team, customer success reps who talk to users, or even friends who match your user persona. The key is that they approach the interactions fresh, without the product team’s assumptions.

Give each scorer a one-page rubric guide with the four dimensions and scoring criteria.

Step 3: Score Each Interaction

Each scorer reads the interaction independently and scores it on four dimensions, each on a 1-5 scale:

Relevance (1-5): Did the AI respond to what the user was actually asking? 1 = completely missed the point. 3 = got the topic right but not the specific need. 5 = precisely addressed the user’s intent.

Coherence (1-5): Was the response logical and internally consistent? 1 = contradictory or confusing. 3 = mostly clear with some rough spots. 5 = perfectly clear and well-structured.

Depth (1-5): Did the response go beyond the obvious? 1 = restated what the user already knew. 3 = provided some useful information. 5 = delivered genuine insight the user would not have found on their own.

Fit (1-5): Was the response appropriate for this specific user? 1 = completely wrong level or style. 3 = somewhat appropriate. 5 = perfectly calibrated to the user’s expertise and needs.

Relevance (1-5)

Did the AI respond to what the user was actually asking? Measures intent understanding. 1 = missed the point entirely. 5 = precisely addressed the user’s intent.

Coherence (1-5)

Was the response logical and internally consistent? Measures output quality. 1 = contradictory or confusing. 5 = perfectly clear and well-structured.

Depth (1-5)

Did the response go beyond the obvious? Measures insight generation. 1 = restated what the user already knew. 5 = delivered genuine insight.

Fit (1-5)

Was the response appropriate for this specific user? Measures personalization. 1 = wrong level or style. 5 = perfectly calibrated to expertise and needs.

Before the Baseline

  • ×Team thinks quality is about 8 out of 10
  • ×No data to support or refute the estimate
  • ×Cannot identify which dimension is weakest
  • ×Improvement efforts are unfocused
  • ×Board gets vague quality assurances

After the Baseline

  • Actual quality score: 3.1 out of 5 (6.2 out of 10)
  • Data from 20 interactions scored by 3 people
  • Fit is the weakest dimension at 2.4 out of 5
  • Clear priority: invest in personalization
  • Board gets a specific number with a plan

Step 4: Calculate and Interpret

For each interaction, average the three scorers’ ratings per dimension. Then average across all 20 interactions to get your dimension scores.

Convert to a 0-1 scale by dividing by 5. This gives you your first alignment-score-compatible baseline:

  • Relevance: average / 5
  • Coherence: average / 5
  • Depth: average / 5
  • Fit: average / 5
  • Overall alignment score: weighted average of the four

The most common result when teams do this for the first time: relevance and coherence are decent (0.65-0.75), depth is moderate (0.55-0.65), and fit is low (0.40-0.55).

This pattern makes sense. Relevance and coherence are about the model - can it understand the question and produce a clear response. Depth and fit are about understanding the user - can it go beyond generic and personalize. Without a self-model, depth and fit are always the weakest dimensions.

Step 5: Check Scorer Agreement

Before trusting the baseline, check whether your three scorers agreed. If they disagree significantly on the same interaction, the rubric needs calibration.

Calculate the average difference between scorers per interaction. If the average difference is more than 1.5 points on the 1-5 scale, your rubric descriptions are not clear enough. Revise and re-score.

If scorers agree within 1 point on most interactions, your baseline is reliable. The agreement itself is a signal - it means quality is objectively visible, not just subjective opinion.

first-baseline.ts
1// Calculate your first baseline2 hours to this number
2const baseline = {
3 sampleSize: 20,
4 scorers: 3,
5 dimensions: {
6 relevance: 0.68, // decent - model understands the question
7 coherence: 0.72, // decent - responses are clear
8 depth: 0.58, // moderate - often surface-level
9 fit: 0.44, // low - AI treats everyone the same
10 },
11 overall: 0.61, // your first alignment score
12 scorerAgreement: 0.82 // high - the rubric is working
13};
14
15// Now you know: fit is the priority investment

What to Do With Your Baseline

The baseline is not the destination. It is the starting line.

Month 1: Set Targets

Set quarterly improvement targets for each dimension. A reasonable target is 10-15% improvement per quarter on your weakest dimension.

Monthly: Repeat the Audit

Run the same 20-interaction audit every month. Track scores over time. Monthly is frequent enough to catch problems and infrequent enough to show real change.

Month 3-4: Automate

Once you have done 3-4 manual baselines and understand what the scores mean, invest in automated measurement. The manual process teaches you what matters first.

Baseline ScoreInterpretationPriority Action
Below 0.40Critical - AI is actively harming the experienceFix accuracy and relevance first
0.40 - 0.55Poor - users notice quality problems regularlyFocus on weakest dimension
0.55 - 0.70Acceptable - functional but not differentiatedInvest in depth and fit
0.70 - 0.85Good - competitive qualityOptimize and personalize
Above 0.85Excellent - quality is a competitive advantageMaintain and protect

The Emotional Challenge

I am going to be honest about something. The hardest part of creating your first baseline is not the process. It is facing the number.

Every team I have guided through this exercise has discovered that their AI quality is lower than they assumed. The typical gap is 20-30% - teams that thought they were at 0.80 discover they are at 0.55-0.65.

This hurts. It feels like failure. But it is the opposite of failure. It is the beginning of improvement. You cannot fix a problem you cannot see. The baseline makes the problem visible, specific, and actionable.

The teams that create the baseline and face the number improve. The teams that avoid measurement stay stuck. Every time.

Trade-offs

Manual baselines do not scale. Scoring 20 interactions by hand works for a monthly check. It does not work for continuous monitoring. Plan to automate within 2-3 months of your first manual baseline.

Small samples have noise. Twenty interactions is enough for a directional baseline but not for statistical precision. Individual scores may be off. The dimensional pattern (which dimensions are strong vs. weak) is usually reliable even with small samples.

Scorer calibration takes time. First-time scorers are inconsistent. The first baseline is less reliable than the third baseline. Accept this and start anyway - an imperfect baseline is infinitely more useful than no baseline.

The baseline may create urgency you are not ready for. When the team sees a 0.44 fit score, they will want to fix it immediately. Make sure you have a realistic improvement plan before sharing the number widely.

What to Do Next

  1. Block 2 hours this week. Put it on the calendar. Pull 20 random interactions, recruit 3 scorers, and run the baseline exercise. You will have your first quality number before Friday.

  2. Share the number with your team. Do not hide it. Do not spin it. Show the four dimensions and identify the weakest. Ask the team what they think would improve the lowest dimension. The conversation will be more productive than any feature brainstorm.

  3. Set up monthly measurement. Commit to running the baseline exercise monthly. By month 3, you will have a trendline. By month 6, you will be ready to automate. See how Clarity automates continuous quality measurement.


The first measurement is always worse than you expect. That is exactly why it matters. Two hours. Twenty interactions. Your first quality baseline. Start measuring.

References

  1. only 1 in 26 unhappy customers actually complains
  2. not a reliable predictor of customer retention
  3. sampling bias, non-response bias, cultural bias, and questionnaire bias
  4. Qualtrics notes in their churn prediction framework
  5. NPS does not correlate with renewal or churn

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →