How to Write an AI Product Spec That Works
Most AI product specs describe what the model should do. The best specs describe what the product should understand. Here is a framework for writing AI specs that survive contact with reality.
TL;DR
- Most AI specs describe model outputs when they should describe user understanding requirements and failure modes
- The best AI specs answer three questions: what does the product need to understand, how does it degrade gracefully, and how do we measure success
- A spec that survives production is shorter, more honest, and more focused on the messy cases than the happy path
An AI product spec that works is structured around five sections: understanding requirements, behavior specification, failure modes, graceful degradation, and evaluation criteria, with 60% of the content covering messy cases rather than the happy path. Most AI specs fail because they describe model outputs for ideal inputs while ignoring what happens when the model is wrong, the user is ambiguous, or the product knows nothing about who is asking. This post covers the five-section framework, why messy cases deserve more attention than happy paths, and the evaluation criteria that predict real-world success.
Why AI Specs Fail
Traditional software specs work because traditional software is deterministic. You specify the input, the transformation, and the output. If the code is correct, the output matches the spec. If it does not match, there is a bug.
AI products are probabilistic. The same input can produce different outputs. The quality of the output depends on context the spec did not anticipate. The model will confidently generate plausible nonsense. The user’s needs evolve between interactions.
When you write a traditional spec for a probabilistic system, you get a product that works in demos, where the inputs are controlled and the context is known, and fails in production, where neither is true.
The fix is not to write longer specs. It is to write specs that are structured for probabilistic systems.
Traditional AI Spec
- ×30 pages of model behavior descriptions
- ×Example inputs and expected outputs
- ×Happy path only with edge cases deferred
- ×Success measured by output accuracy on test set
AI Spec That Works
- ✓Understanding requirements before behavior specifications
- ✓Failure modes and graceful degradation as first-class sections
- ✓Messy cases get more detail than happy path
- ✓Success measured by user alignment over time
The Five-Section Framework
Every AI product spec should have five sections. Not ten. Not twenty. Five. Each section answers a specific question that probabilistic systems demand.
Section 1: Understanding Requirements. What does the product need to understand about the user to do this job well? This is your belief inventory. List the 3 to 7 things the product must know about each user, how those things are learned, and how confident the product needs to be before acting on them.
Section 2: Behavior Specification. Given the understanding above, what does the product do? This is the traditional spec section, but it is conditional on understanding. Instead of writing the model should respond with a concise answer, you write when the product believes the user prefers concise responses with confidence above 0.7, respond concisely. Otherwise, provide moderate detail.
Section 3: Failure Modes. What happens when things go wrong? List every way the feature can fail, wrong output, ambiguous input, low confidence, out-of-domain request, contradictory context, and specify the product behavior for each. This section should be longer than Section 2.
Section 4: Graceful Degradation. What is the minimum viable experience? When the product has no understanding, low confidence, or encounters an unknown situation, what does it do? This is your cold start, your fallback, your safety net. It should be good enough to retain the user while the product learns.
Section 5: Evaluation Criteria. How do you know if this is working? Specify metrics that connect model performance to user outcomes. Not just accuracy on a test set, alignment over time, confidence calibration accuracy, failure mode coverage.
Section 1: Understanding Requirements
What does the product need to understand about the user? List 3-7 beliefs, how they are learned, and the confidence threshold before acting on them.
Section 2: Behavior Specification
What the product does, conditional on understanding. “When confidence above 0.7 for concise preference, respond concisely. Otherwise, provide moderate detail.”
Section 3: Failure Modes
Every way the feature can fail: wrong output, ambiguous input, low confidence, out-of-domain, contradictory context. This section should be longer than Section 2.
Section 4: Graceful Degradation
The minimum viable experience. Cold start behavior, fallbacks, and safety nets. Good enough to retain the user while the product learns.
Section 5: Evaluation Criteria
Metrics connecting model performance to user outcomes: alignment over time, confidence calibration, failure mode coverage, recovery rate, time to value.
1// Section 1: Understanding Requirements← what the product needs to know2const understanding = {3beliefs: [4{ name: 'response_style', threshold: 0.7, source: 'behavioral' },5{ name: 'expertise_level', threshold: 0.6, source: 'onboarding + behavioral' },6{ name: 'primary_use_case', threshold: 0.8, source: 'direct question' },7]8};910// Section 3: Failure Modes← longer than the happy path11const failureModes = {12low_confidence: 'Ask clarifying question before responding',13out_of_domain: 'Acknowledge limitation, suggest alternative',14contradictory_context: 'Surface the contradiction to user',15model_error: 'Fallback to template response + flag for review',16};1718// Section 4: Graceful Degradation← the minimum viable experience19const coldStart = {20no_beliefs: 'Use sensible defaults + ask 2 bootstrap questions',21partial_beliefs: 'Use known beliefs, default on unknown',22confidence_below_threshold: 'Hedge response + gather signal',23};
The Messy Cases Principle
Here is the counterintuitive rule: spend 60 percent of your spec on the messy cases and 40 percent on the happy path.
The happy path is easy to describe because it is what you imagine. The messy cases are hard to describe because they are what actually happens. Users ask ambiguous questions. Context is incomplete. The model misinterprets intent. The user contradicts themselves. The session is their first interaction and the product knows nothing about them.
These messy cases are where products differentiate. Every AI product handles the happy path well enough. The product that handles the messy cases gracefully, that says I am not sure what you mean, can you clarify instead of confidently generating garbage, is the product that earns trust.
Earn trust once, and the user forgives future mistakes. Lose trust once with a confidently wrong answer, and recovery takes ten correct answers.
| Spec Section | Traditional Allocation | Recommended Allocation |
|---|---|---|
| Understanding requirements | 0 percent | 20 percent |
| Happy path behavior | 70 percent | 20 percent |
| Failure modes | 10 percent | 30 percent |
| Graceful degradation | 5 percent | 15 percent |
| Evaluation criteria | 15 percent | 15 percent |
The Evaluation Trap
The most common evaluation mistake in AI specs is specifying test set accuracy as the success metric. The model is 92 percent accurate on the test set. Ship it.
The problem is that test set accuracy measures the happy path. It does not measure how the product handles ambiguity, how it improves per user over time, or whether users trust it. A product that is 92 percent accurate but confidently wrong the other 8 percent will lose users faster than a product that is 85 percent accurate but says I am not sure when it does not know.
Better evaluation criteria for AI product specs:
- Alignment score trend, is the product getting better at understanding each user over time?
- Confidence calibration, when the product is confident, is it correct? When it is uncertain, does it say so?
- Failure mode coverage, what percentage of messy cases are handled gracefully?
- Recovery rate, after a bad interaction, does the user try again or leave?
- Time to value, how many interactions before the product feels personalized?
Alignment Score Trend
Is the product getting better at understanding each user over time? Measures the trajectory of personalization quality.
Confidence Calibration
When the product is confident, is it correct? When uncertain, does it say so? Calibration predicts user trust.
Failure Mode Coverage
What percentage of messy cases are handled gracefully? Every AI product differentiates on how it handles failures, not successes.
Recovery Rate & Time to Value
After a bad interaction, does the user try again? How many interactions before the product feels personalized? Both measure earned trust.
Trade-offs
Writing specs this way requires cultural change. Here is what to expect:
Longer planning phase. Understanding requirements and failure modes take time to enumerate. Plan for an extra 2 to 3 days of spec writing compared to traditional specs. The payoff is fewer surprises in production.
Engineering pushback. Engineers are used to specs that describe desired outputs, not desired understanding. The shift requires new thinking about how to implement features. Plan for a spec review meeting that includes engineering from the start.
Discomfort with uncertainty. Traditional specs promise specific outcomes. AI specs that honestly enumerate failure modes and degradation paths feel less certain. This discomfort is productive, it means you are being honest about what probabilistic systems can and cannot guarantee.
Harder to estimate. When 60 percent of the spec is messy cases, engineering estimates become harder because messy cases are genuinely harder to implement. Use the failure mode list as an explicit backlog and prioritize which ones to handle in V1 versus V2.
Spec maintenance. As the product learns and failure modes change, the spec needs updating. Build spec review into your sprint retrospective rather than treating it as a write-once artifact.
What to Do Next
1. Audit your last AI spec. Open the most recent AI feature spec your team wrote. Count the words spent on happy path versus messy cases. If the ratio is above 3 to 1, your spec was optimized for demos, not production.
2. Write an understanding requirements section. For the feature you are speccing now, list the 3 to 5 things the product needs to understand about each user. For each, specify the confidence threshold and learning mechanism. This single section will change how your team thinks about the feature.
3. Enumerate failure modes before you write the happy path. Start your next spec with Section 3: Failure Modes. List every way the feature can go wrong. Then write the happy path as the case where none of those failures occur. You will be surprised how much clearer the happy path becomes when you define it by exclusion.
Stop writing specs for demos. Start writing specs for reality. Build AI products that handle the messy cases with Clarity.
References
- 2016 survey of 2,000 Americans by Reelgood and Learndipity Data Insights
- Product vs. Feature Teams
- only 1 in 26 unhappy customers actually complains
- Scientific American explains
- cold start problem
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →