Skip to main content

Minimum Viable Evals: Start Here

You do not need MLflow or LangSmith on day one. You need a spreadsheet and 50 test cases. How to build your first AI evaluation system in a single afternoon using Hamel Husain's Level 1 approach.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 8 min read

TL;DR

  • You do not need MLflow, LangSmith, or any platform to start measuring AI quality - 50 test cases in a spreadsheet will catch the majority of regressions on day one
  • Hamel Husain’s Level 1 eval approach is the starting point: simple input/output assertions that run on every deployment before you invest in anything more complex
  • The gap between zero evals and minimum viable evals is the most important quality leap your AI product will ever make - everything else is incremental

Minimum viable evals require nothing more than 50 test cases and a single afternoon to build, catching 80% of quality regressions without any evaluation platform. Most AI teams ship for months without evals because the tooling landscape makes evaluation feel like a massive infrastructure decision. This post covers how to build your first eval suite from scratch using Hamel Husain’s Level 1 approach, why 50 test cases is enough to start, and how to evolve from basic assertions to self-model-aware evaluation.

0
test cases is all you need to start
0%
of regressions caught by basic assertions
0
afternoon to build your first eval suite
0
platforms required on day one

The Minimum Viable Eval

Here is what minimum viable evals look like. It is embarrassingly simple. That is the point.

Step 1: Collect 50 real inputs. Go to your production logs. Pull 50 actual user queries or inputs. Not synthetic - real. These should span your common use cases, a few edge cases, and at least five inputs where you know the AI currently struggles.

Step 2: Write expected outputs. For each input, write down what a good response looks like. Not the exact wording - the criteria. For a customer support AI: does it correctly identify the issue? Does it provide the right resolution? Does it avoid hallucinating a policy that does not exist? For a search AI: does it return relevant results? Does it handle ambiguous queries gracefully?

Step 3: Run and score. Pass each input through your AI. Compare the output to your expected criteria. Score it pass/fail. If you want more nuance, score each criterion on a 1-3 scale.

Step 4: Automate the run. Write a script that runs all 50 test cases and reports the pass rate. Hook it into your CI/CD pipeline. Now every deployment gets a quality gate.

That is it. You now have evals. They are not fancy. They are not comprehensive. But they catch regressions that would otherwise reach users, and they give your team a shared understanding of what good means.

Step 1: Collect 50 Real Inputs

Pull actual user queries from production logs. Not synthetic. Span common use cases, edge cases, and known struggles.

Step 2: Write Expected Outputs

For each input, define the criteria for a good response. Not exact wording, but pass/fail conditions that capture quality.

Step 3: Run and Score

Pass each input through your AI. Compare to expected criteria. Score pass/fail or on a 1-3 scale per criterion.

Step 4: Automate the Run

Write a script that executes all 50 test cases and reports the pass rate. Hook it into CI/CD. Every deployment gets a quality gate.

No Evals

  • ×Ship and pray
  • ×Find regressions when users complain
  • ×No shared definition of quality
  • ×Prompt changes are YOLO
  • ×Month 9 without measurement

Minimum Viable Evals

  • Ship with confidence on known cases
  • Catch regressions before deployment
  • 50 documented quality expectations
  • Prompt changes tested automatically
  • Built in a single afternoon

Why 50 Test Cases Is Enough to Start

Fifty sounds arbitrary. It is not. Here is the math.

If your AI product handles 10 common use cases, 50 test cases gives you 5 examples per use case. That is enough to catch catastrophic regressions - the cases where a prompt change breaks an entire category of responses.

You are not trying to prove statistical significance. You are trying to catch the obvious failures that currently ship to production undetected. And those failures are obvious. When a prompt change causes your AI to hallucinate company policies, you do not need 500 test cases to detect it. You need one.

As Hamel Husain argues in his field guide to AI products [1], the value of evals is not just catching bugs - it is building an understanding of your AI’s behavior patterns. Those 50 test cases are the beginning of institutional knowledge about what your AI does well and where it struggles.

The Three Levels: Where MVE Fits

Hamel Husain’s eval framework [2] describes three levels of evaluation maturity:

Level 1 - Assertions. Automated checks that verify specific behaviors. Does the AI refuse to answer out-of-scope questions? Does it include a disclaimer when giving medical information? Does it cite sources when asked? These are the building blocks of your 50-test-case suite.

Level 2 - Human Review. Domain experts reviewing a sample of outputs against defined criteria. This catches the subtle failures that assertions miss: technically correct but unhelpful responses, appropriate but tone-deaf answers, accurate but unnecessarily complex explanations.

Level 3 - System Metrics. Aggregate measurements of quality over time. Alignment scores, user satisfaction trends, dimension-level tracking. This is where you understand whether your AI is getting better or worse at serving users.

Minimum viable evals are Level 1. They are the foundation. You cannot skip to Level 2 or 3 without them - because without assertions, you do not know what your AI does consistently enough to review or measure.

Level 1: Assertions

Automated checks that verify specific behaviors. Does the AI refuse out-of-scope questions? Does it cite sources when asked? The building blocks of your 50-test-case suite.

Level 2: Human Review

Domain experts reviewing output samples against defined criteria. Catches the subtle failures assertions miss: technically correct but unhelpful, appropriate but tone-deaf responses.

Level 3: System Metrics

Aggregate measurements of quality over time. Alignment scores, satisfaction trends, dimension-level tracking. Reveals whether the AI is getting better or worse at serving users.

Building Your First 50 Test Cases

Here is the framework I use with every team I work with.

20 common cases. These are your happy-path scenarios. The inputs that represent 80% of your traffic. If your AI product is a customer support agent, these are the 20 most common questions. If it is a search engine, these are the 20 most common queries.

15 edge cases. These are the inputs where context matters. Ambiguous queries. Multi-intent inputs. Questions that require reasoning across multiple pieces of information. These test whether your AI handles complexity gracefully.

10 failure cases. These are inputs where your AI currently fails. You know about these - users have reported them, or your team has noticed them during demos. Include them explicitly so you catch regressions in areas you have already tried to fix.

5 adversarial cases. These are inputs designed to break your AI. Prompt injection attempts. Out-of-scope requests. Contradictory instructions. These ensure your AI fails safely.

20 Common Cases

Happy-path scenarios representing 80% of traffic. The most common questions or queries your AI handles daily.

15 Edge Cases

Inputs where context matters. Ambiguous queries, multi-intent inputs, and questions requiring reasoning across multiple pieces of information.

10 Failure Cases

Inputs where your AI currently fails. Known from user reports or demo observations. Include them to catch regressions in areas already addressed.

5 Adversarial Cases

Inputs designed to break the AI. Prompt injection attempts, out-of-scope requests, contradictory instructions. Ensures safe failure modes.

minimum-viable-eval.ts
1// Your first eval suite - no platform required50 cases, one afternoon
2const testCases = [
3 { input: 'How do I reset my password?',
4 criteria: ['mentions settings page', 'includes link', 'no hallucinated steps'],
5 category: 'common' },
6 { input: 'I want to cancel but also upgrade',
7 criteria: ['asks for clarification', 'does not assume intent'],
8 category: 'edge' },
9];
10
11// Run against your AI and scoreAutomate this in CI/CD
12for (const tc of testCases) {
13 const response = await ai.generate(tc.input);
14 const passed = tc.criteria.every(c => evaluateCriterion(response, c));
15 results.push({ ...tc, passed, response });
16}

From MVE to Self-Model-Aware Evals

Minimum viable evals have a ceiling. They test whether the AI handles inputs correctly for a generic user. They do not test whether the AI handles inputs correctly for a specific user.

Consider this test case: the input is recommend a project management approach. The expected output depends entirely on who is asking. A solo founder needs something different from a VP of Engineering at a 500-person company. A technical audience expects different depth than a non-technical one.

This is where the eval system needs to know the user. And that is where self-models extend Hamel’s framework.

A self-model captures the user’s beliefs, expertise, goals, and context. When your eval system incorporates self-models, the test case becomes: given this user’s context, did the AI produce a relevant, coherent, deep, and personalized response? The same input can have different correct answers depending on the user.

You do not need self-models on day one. But you should know that your MVE suite will eventually hit this ceiling - and plan for it.

Eval MaturityInfrastructure RequiredWhat It CatchesWhat It Misses
No evalsNothingNothingEverything
50 test casesA script and a spreadsheetCatastrophic regressionsSubtle quality issues
Human review protocolReviewer time plus criteria docsNuanced quality gapsUser-specific failures
Self-model-aware evalsClarity API plus user contextPersonalization failuresNothing significant

Trade-offs

Minimum viable evals give false confidence. Passing 50 test cases does not mean your AI is good. It means your AI is not broken on those specific inputs. The gap between passing your test suite and satisfying users is real - but it is smaller than the gap between zero evals and 50 test cases.

Test case maintenance is ongoing. As your product evolves, your test cases become stale. A test case written six months ago might test a feature that no longer exists. Plan to review and update your suite quarterly.

Automated scoring has limits. Simple criteria like mentions settings page are easy to automate. Criteria like provides a helpful response are not. You will need human review (Level 2) for subjective quality dimensions.

Starting simple can feel embarrassing. When competitors talk about their sophisticated eval platforms, a spreadsheet with 50 test cases feels inadequate. It is not. It is the foundation that makes everything else possible.

What to Do Next

  1. Block two hours this week. Pull 50 real user inputs from production logs. Write expected criteria for each. Score your AI against them by hand. You will learn more about your product quality in those two hours than in any dashboard session.

  2. Automate the run. Take your 50 test cases and write a script that executes them automatically. Add it to your CI/CD pipeline so every deployment runs the suite. Hamel Husain’s eval framework [3] has practical examples of how to structure this.

  3. Plan for Level 2. Identify two domain experts on your team who can review AI outputs weekly. Define the criteria they will use - borrowing from your rubric. This human review layer will catch what assertions miss. When you are ready to add user context, Clarity makes self-model-aware evals possible.


The best eval system is the one you actually build. Not the one you are still evaluating. Start with 50 test cases today.

References

  1. field guide to AI products
  2. Hamel Husain’s eval framework
  3. Hamel Husain’s eval framework
  4. Product vs. Feature Teams
  5. only 1 in 26 unhappy customers actually complains
  6. Qualtrics notes in their churn prediction framework
  7. Continuous Discovery Habits
  8. 80% of features in the average software product are rarely or never used

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →