Skip to main content

The AI Product Debug Loop

Hamel Husain's insight that eval infrastructure IS debug infrastructure. How to build a system that catches AI product problems before users do, using the same feedback loop that makes software debugging work.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 7 min read

TL;DR

  • Eval infrastructure and debug infrastructure are the same thing: teams that build them separately waste effort on two half-systems instead of one complete system
  • The AI product debug loop follows four steps: observe the quality signal, hypothesize the failure layer, test the hypothesis, and fix the specific cause, but the observation step requires user context
  • Adding self-models to the debug loop lets you distinguish between model failures and personalization failures, which is the diagnosis most teams miss

The AI product debug loop integrates evaluation and debugging into a single system that both detects quality problems and diagnoses their root cause, following the insight that eval infrastructure IS debug infrastructure. Teams that build separate eval and debug systems waste effort on two half-systems where quality drops take days to diagnose instead of hours. This post covers the four-step debug loop (observe, hypothesize, test, fix), the five failure layers to check, and how self-models enable user-context debugging that distinguishes model failures from personalization failures.

0 steps
in the debug loop
0
failure layers to check
0%
of quality failures are context failures
0 system
for eval and debug (not two)

The Four-Step Debug Loop

The AI product debug loop mirrors the scientific method applied to quality problems. Observe, hypothesize, test, fix.

Step 1: Observe. Something triggered the debug loop. Maybe an eval dimension dropped. Maybe a user reported a bad experience. Maybe a safety assertion failed. The observation includes the input, the output, the eval scores, and, critically, the user context. Without user context, you cannot tell whether the problem is a model failure or a personalization failure.

Step 2: Hypothesize. Which failure layer caused the problem? There are five: the model itself (capability issue), the prompt (instruction issue), the retrieval (context issue), the user model (understanding issue), or the integration (pipeline issue). The eval scores help narrow it down. Low accuracy suggests a model or retrieval failure. Low fit suggests a user model failure. Low coherence suggests a pipeline or prompt failure.

Step 3: Test. Isolate the suspected layer and test it. If you suspect a retrieval failure, check what documents were retrieved and whether they were relevant. If you suspect a user model failure, check whether the self-model accurately reflected the user’s context. If you suspect a prompt failure, test the same input with a modified prompt.

Step 4: Fix. Apply the fix to the specific layer. Update the prompt, improve the retrieval query, correct the self-model, or escalate to a model-level issue. Then verify the fix against the original failing case and your broader eval suite.

Step 1: Observe

An eval dimension dropped, a user reported a bad experience, or a safety assertion failed. Capture the input, output, eval scores, and critically, the user context.

Step 2: Hypothesize

Which of the five failure layers caused the problem? Use eval scores to narrow it down: low accuracy suggests model or retrieval, low fit suggests user model, low coherence suggests pipeline.

Step 3: Test

Isolate the suspected layer and test it. Check retrieved documents for relevance, verify self-model accuracy, or test modified prompts against the failing case.

Step 4: Fix

Apply the fix to the specific layer. Verify against the original failing case and your broader eval suite. The fix becomes a new assertion preventing recurrence.

Separate Eval and Debug Systems

  • ×Eval system detects quality dropped 5%
  • ×Debug team starts from scratch, what broke?
  • ×Days of investigation to find root cause
  • ×No connection between the detection and the diagnosis
  • ×Same failure happens again next month

Integrated Debug Loop

  • Eval system detects quality dropped 5%
  • Same system shows which dimension, which users, which inputs
  • Root cause identified in hours, not days
  • Detection and diagnosis share the same data
  • Fix becomes a new assertion preventing recurrence

The Five Failure Layers

Every AI quality problem lives in one of five layers. The debug loop works by systematically checking each layer until you find the cause.

Layer 1: Model Capability. The model cannot do what you are asking. No amount of prompting or context will fix it. This is the rarest failure mode but the hardest to resolve, it requires model upgrades or fine-tuning.

Layer 2: Prompt Instructions. The model can do it, but the prompt does not ask correctly. Ambiguous instructions, missing constraints, or conflicting directives cause the model to behave unpredictably. This is the most common failure mode and the easiest to fix.

Layer 3: Retrieval Context. The prompt is correct, but the AI received the wrong context. The RAG pipeline retrieved irrelevant documents, or retrieved the right documents but the wrong sections. The output is faithful to the context, but the context was wrong.

Layer 4: User Understanding. The retrieval is correct, the prompt is correct, but the AI does not understand who it is talking to. It gives an expert-level answer to a beginner, or a beginner-level answer to an expert. The content is right but the presentation is wrong for this person.

Layer 5: Integration Pipeline. Everything works in isolation but fails in combination. Latency issues cause context to arrive after the response starts generating. Caching serves stale personalization. Rate limiting degrades quality under load.

Layer 1: Model Capability

The model cannot do what you are asking. The rarest failure mode but the hardest to resolve. Requires model upgrades or fine-tuning.

Layer 2: Prompt Instructions

The model can do it, but the prompt does not ask correctly. The most common failure mode and the easiest to fix.

Layer 3: Retrieval Context

The prompt is correct, but the AI received the wrong context. The output is faithful to the context, but the context was wrong.

Layer 4: User Understanding

The content is right but the presentation is wrong for this person. Expert-level answer to a beginner, or vice versa.

Layer 5: Integration Pipeline

Everything works in isolation but fails in combination. Latency, caching, and rate limiting degrade quality under real-world conditions.

debug-loop.ts
1// The integrated debug loopEval system IS debug system
2const debugReport = await clarity.debug({
3 interaction: failingInteraction,
4 evalScores: { accuracy: 0.92, relevance: 0.85, fit: 0.45, coherence: 0.88 },
5 userContext: await clarity.getSelfModel(userId),
6});
7
8// Automatic layer diagnosisLow fit + good accuracy = Layer 4
9// debugReport: {
10// suspectedLayer: 'user_understanding',
11// evidence: 'Self-model shows expertise=beginner, but response used
12// advanced terminology. User model may be stale.',
13// suggestedFix: 'Refresh self-model from recent interactions',
14// relatedFailures: 12, // Same pattern in 12 other interactions
15// }

Building the Integrated System

Here is how to combine eval and debug infrastructure into one system.

Shared data layer. Every interaction is logged with its input, output, eval scores, user context, retrieved documents, and prompt. This single log is used for both aggregate quality monitoring (eval) and individual failure investigation (debug).

Dimension-level scoring. Score each interaction on multiple dimensions, accuracy, relevance, fit, coherence, not just a single quality score. This dimensional breakdown is the diagnostic information that connects eval detection to debug investigation.

Automatic layer attribution. When a dimension scores low, automatically suggest which failure layer is the likely cause. Low accuracy with high relevance suggests Layer 1 (model capability) or Layer 3 (retrieval). Low fit with high accuracy suggests Layer 4 (user understanding). These attributions accelerate debugging from days to hours.

Regression-to-assertion pipeline. When you fix a bug, the failing case becomes a new assertion in your eval suite. This creates a flywheel where every debug session strengthens your evaluation. Over time, your eval suite becomes a comprehensive map of every failure mode your product has ever encountered.

Foundation: Shared Data Layer

Log every interaction with input, output, eval scores, user context, retrieved documents, and prompt. One log for both eval monitoring and debug investigation.

Scoring: Dimension-Level Evaluation

Score each interaction on accuracy, relevance, fit, and coherence. This dimensional breakdown connects eval detection to debug investigation.

Diagnosis: Automatic Layer Attribution

When a dimension scores low, automatically suggest the likely failure layer. Low fit with high accuracy points to Layer 4 (user understanding).

Flywheel: Regression-to-Assertion Pipeline

Every bug fix becomes a new assertion in your eval suite. Over time, your eval suite becomes a comprehensive map of every failure mode encountered.

The User Context Advantage

The debug loop hits a wall without user context. Here is why.

When the AI gives a technically correct but unsatisfying response, the failure could be in the prompt (Layer 2), the retrieval (Layer 3), or the user understanding (Layer 4). Without knowing the user, you cannot distinguish between these layers.

Was the response too technical? You need to know the user’s expertise level. Did the response ignore the user’s goals? You need to know what they were trying to accomplish. Did the response contradict previous advice? You need to know the interaction history.

Self-models provide this context to the debug loop. They transform a vague quality signal into a specific diagnosis. Instead of the response was not great, you get the response used advanced terminology for a user whose self-model indicates beginner-level expertise. Layer 4 failure.

This diagnostic precision is the difference between spending three days investigating a quality drop and resolving it in three hours.

Debug Without User ContextDebug With User Context
Quality dropped but we do not know whyQuality dropped for beginner users on technical topics
The response seems okay to meThe response is good for experts but bad for this user’s level
Could be prompt, retrieval, or personalizationSelf-model shows stale expertise assessment. Layer 4
Investigation takes daysDiagnosis takes hours

Trade-offs

Integrated systems are more complex to build initially. Combining eval and debug requires careful data modeling and infrastructure decisions. The upfront investment is higher than building two simple systems, but the long-term maintenance cost is lower.

Automatic layer attribution is imperfect. The system can suggest which layer likely caused the failure, but human judgment is still needed to confirm. Treat attributions as starting hypotheses, not definitive diagnoses.

The regression-to-assertion pipeline needs curation. Not every bug fix deserves a permanent assertion. If you add every failing case, your assertion suite becomes bloated and slow. Curate assertions to represent patterns, not individual incidents.

User context adds privacy considerations. Including user information in debug logs raises data protection concerns. Implement appropriate access controls and anonymization for debug data.

What to Do Next

  1. Audit whether your eval and debug systems share data. If they are separate, identify the five pieces of information that both need: input, output, eval scores, user context, and retrieved documents. Consolidate them into a shared log.

  2. Add dimension-level scoring to your eval suite. If you currently produce a single quality score, decompose it into accuracy, relevance, fit, and coherence. This dimensional breakdown is what enables automatic layer attribution during debugging. Hamel Husain’s eval framework [1] provides practical guidance on building these dimensions.

  3. Start the regression-to-assertion pipeline. The next time you debug a quality issue, turn the failing case into a new assertion in your eval suite. After 30 debug sessions, your eval suite will be dramatically more comprehensive. Clarity provides the self-model layer that makes user-context debugging possible.


Your eval system should diagnose problems, not just detect them. Build the debug loop.

References

  1. Hamel Husain’s eval framework
  2. 2016 survey of 2,000 Americans by Reelgood and Learndipity Data Insights
  3. Product vs. Feature Teams
  4. only 1 in 26 unhappy customers actually complains
  5. Scientific American explains
  6. cold start problem

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →