Skip to main content

The Eval Stack for AI Product Teams

Unit tests, human review, A/B tests. The 3-level eval pyramid from Hamel Husain's framework, plus where self-models fit as the context layer that makes every level smarter.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 7 min read

TL;DR

  • The eval stack mirrors the software testing pyramid: unit-level assertions at the base, human review in the middle, system-level A/B tests and metrics at the top, and most AI teams build it upside down
  • Hamel Husain’s 3-level eval framework provides the skeleton, but the stack becomes truly powerful when a context layer connects each level to user understanding
  • Self-models are the missing infrastructure layer that makes assertions user-aware, human review targeted, and system metrics personalized

The eval stack for AI product teams mirrors the software testing pyramid: automated assertions at the base, structured human review in the middle, and system-level A/B tests at the top. Most AI teams build this pyramid upside down, with dashboards full of aggregate metrics and no unit-level quality gates to catch regressions before they reach users. This post covers how to build each level of the eval pyramid based on Hamel Husain’s framework, and where self-models fit as the context layer that makes every level user-aware.

0
levels in the eval pyramid
0%
of checks should be at Level 1
0%
at Level 2 (human review)
0%
at Level 3 (system metrics)

Level 1: Assertions (The Foundation)

Level 1 is the cheapest, fastest, and most important layer. These are automated checks that run on every deployment, every prompt change, every model update. They catch regressions in seconds.

Behavioral assertions verify that the AI follows rules. Does it refuse out-of-scope requests? Does it include required disclaimers? Does it avoid generating harmful content? These are binary pass/fail checks that protect your product from catastrophic failures.

Quality assertions verify that outputs meet minimum standards. Does the response stay under the token limit? Does it reference the correct knowledge base? Does it maintain consistent terminology? These catch the subtle regressions that degrade quality over time.

Consistency assertions verify that the AI behaves predictably. Given the same input, does it produce similar-quality outputs? Does it maintain its persona across interactions? Does it handle paraphrased versions of the same question consistently?

As Hamel Husain emphasizes [1], these assertions are not comprehensive quality checks. They are safety nets. They catch the obvious failures so that your more expensive evaluation layers can focus on the nuanced ones.

Behavioral Assertions

Binary pass/fail checks. Does the AI refuse out-of-scope requests? Include required disclaimers? Avoid harmful content? Catches catastrophic failures.

Quality Assertions

Minimum standard checks. Does the response stay under the token limit? Reference the correct knowledge base? Maintain consistent terminology? Catches subtle regressions.

Consistency Assertions

Predictability checks. Same input produces similar-quality outputs? Persona maintained across interactions? Paraphrased questions handled consistently?

Level 2: Human Review (The Middle Layer)

Level 2 is where domain experts evaluate a sample of AI outputs against your quality rubric. This catches what assertions miss, the outputs that are technically correct but unhelpful, appropriate but tone-deaf, accurate but confusing.

The key to effective human review is structure. Without a rubric, reviewers grade against their own private standards. With a rubric, they grade against shared criteria.

Sampling strategy matters. Do not review a random sample. Review a stratified sample: some high-confidence outputs (to verify the assertions are catching what they should), some low-confidence outputs (to find quality gaps), some edge cases (to understand boundary behavior), and some user-flagged interactions (to calibrate against real user perception).

Reviewer calibration matters more. As Hamel Husain notes, if your reviewers disagree with each other 30% of the time, your evaluation system has a 30% noise floor. Start every review cycle with a calibration session where reviewers grade the same 10 interactions and discuss disagreements.

The outputs of Level 2 feed back into Level 1. Every pattern the human reviewers identify, a new failure mode, a quality regression, an edge case, becomes a new assertion.

Inverted Eval Pyramid

  • ×Lots of dashboards and system metrics
  • ×Occasional ad-hoc human review
  • ×Almost no automated assertions
  • ×Regressions discovered by users
  • ×Weeks of latency between cause and detection

Proper Eval Pyramid

  • 200+ automated assertions at Level 1
  • Weekly structured human review at Level 2
  • System metrics informed by lower levels
  • Regressions caught before deployment
  • Minutes between cause and detection

Level 3: System Metrics (The Capstone)

Level 3 is where you measure quality at the system level. This is not about individual interactions, it is about trends, distributions, and aggregate quality across your user base.

Quality distribution tracking shows you how the AI performs across all users, not just the average. If your mean quality score is 0.82 but the bottom 10% of users experience a score of 0.55, you have a personalization problem that the average obscures.

Dimension trend analysis tracks each quality dimension, relevance, coherence, depth, fit, over time. This reveals whether model updates, prompt changes, or infrastructure modifications improve one dimension at the expense of another.

A/B testing compares different approaches with statistical rigor. When you change the retrieval pipeline, does quality actually improve? When you add few-shot examples to the prompt, does coherence increase? Level 3 provides the causal evidence that Levels 1 and 2 cannot.

The Context Layer: Where Self-Models Fit

Here is the insight that took me two years to fully internalize. Every level of the eval stack has the same blind spot: it evaluates outputs without knowing who the output was for.

Level 1 without context checks whether the response is generally correct. Level 1 with context checks whether it is correct for this user. A Level 1 assertion that says the response should be accessible to a non-technical reader fails for an expert who wants technical depth.

Level 2 without context has reviewers grading outputs generically. Level 2 with context provides the reviewer with the user’s expertise level, goals, and history, so they can evaluate whether the AI served the person, not just the question.

Level 3 without context tracks aggregate quality across all users. Level 3 with context segments quality by user type, reveals which user populations are underserved, and identifies the personalization gaps that drive churn.

Level 1 + Context: User-Aware Assertions

Checks whether the response is correct for this user, not just generally correct. An assertion that enforces “accessible to a non-technical reader” adapts to actually check for the right depth per user.

Level 2 + Context: Targeted Human Review

Reviewers see the user’s expertise level, goals, and history. They evaluate whether the AI served the person, not just the question. Review becomes calibrated to real user needs.

Level 3 + Context: Segmented System Metrics

Quality segmented by user type reveals which populations are underserved. Identifies the personalization gaps that drive churn. Makes invisible user experiences visible.

Self-models are the context layer that plugs into every level of the pyramid. They do not replace the eval stack, they make every layer smarter.

eval-stack.ts
1// The eval stack with context layerSelf-models make every level smarter
2// Level 1: Context-aware assertions
3const level1 = await clarity.runAssertions(response, {
4 userContext: selfModel, // Fit checks need the user
5 rubric: assertionSuite,
6});
7
8// Level 2: Targeted human reviewReviewers see the user context too
9const level2 = await clarity.queueForReview(response, {
10 userContext: selfModel,
11 priority: level1.fitScore < 0.7 ? 'high' : 'normal',
12});
13
14// Level 3: Segmented system metricsQuality tracked per user segment
15const level3 = await clarity.trackQuality(response, {
16 userSegment: selfModel.expertiseLevel,
17 dimensions: ['relevance', 'coherence', 'depth', 'fit'],
18});

Building the Stack Incrementally

You do not need to build all three levels simultaneously. Here is the order that produces the most value with the least investment:

Month 1: Build Level 1. Write 50 assertions based on your most common failure modes. Hook them into your CI/CD pipeline. Every deployment now runs a quality gate. This alone will prevent more regressions than any other investment.

Month 2: Start Level 2. Establish a weekly human review cadence. Two domain experts review 25 interactions against a shared rubric. Feed findings back into Level 1 as new assertions. This creates a flywheel where human insight continuously strengthens automated checks.

Month 3: Add the Context Layer. Integrate self-models so that Level 1 assertions and Level 2 reviews can account for who the user is. This is when your eval stack starts measuring personalization quality, the dimension that most directly predicts retention.

Month 4+: Build Level 3. With the lower levels producing reliable signals, start tracking aggregate quality metrics, running A/B tests, and building the dashboards that executives need.

Month 1: Build Level 1

Write 50 assertions based on your most common failure modes. Hook them into CI/CD. Every deployment now runs a quality gate. This alone prevents more regressions than any other investment.

Month 2: Start Level 2

Weekly human review cadence. Two domain experts review 25 interactions against a shared rubric. Feed findings back into Level 1 as new assertions. Creates a flywheel.

Month 3: Add the Context Layer

Integrate self-models so Level 1 and Level 2 account for who the user is. This is when your eval stack starts measuring personalization quality, the dimension that predicts retention.

Month 4+: Build Level 3

Aggregate quality metrics, A/B tests, and executive dashboards. With reliable lower-level signals, system metrics become actionable rather than decorative.

Eval LevelSpeedCostWhat It CatchesWhen to Build
Level 1 (Assertions)SecondsLowRegressions, rule violationsMonth 1
Level 2 (Human Review)DaysMediumNuanced quality gapsMonth 2
Context LayerMillisecondsMediumPersonalization failuresMonth 3
Level 3 (System Metrics)WeeksHighTrends and distributionsMonth 4+

Trade-offs

Building the pyramid bottom-up feels slow. Teams want dashboards immediately. But dashboards without assertions are thermometers without a diagnosis, they tell you something is wrong without helping you find it.

Assertions require maintenance. As your product evolves, assertions become stale or irrelevant. Budget 10% of your eval time for assertion review and pruning.

Human review does not scale linearly. As your product grows, you need more reviewers, or you need LLM-as-judge to augment human review. Plan for this transition when your interaction volume exceeds what 2-3 reviewers can handle.

The context layer adds infrastructure complexity. Self-models require storage, retrieval, and privacy considerations. The investment is worth it for personalization quality, but it is not free.

What to Do Next

  1. Audit your current eval stack. Map what you have against the three levels. Most teams will find an inverted pyramid, lots of system metrics, little automation. Identify the biggest gap and start there.

  2. Write 50 Level 1 assertions this week. Use Hamel Husain’s eval framework [2] as your guide. Focus on behavioral assertions first (safety), then quality assertions (standards), then consistency assertions (predictability).

  3. Plan for the context layer. Even if you cannot implement self-models this month, design your eval stack with a user-context slot at every level. When you are ready to add personalization quality measurement, the architecture will be waiting. Clarity provides the self-model API that fills that slot.


Software engineering built the testing pyramid twenty years ago. AI products need the same discipline. Build the base first.

References

  1. Hamel Husain emphasizes
  2. Hamel Husain’s eval framework
  3. Twilio Segment’s 2024 State of Personalization Report
  4. Thoughtworks’ strategic framework for evaluating third-party solutions
  5. 2016 survey of 2,000 Americans by Reelgood and Learndipity Data Insights
  6. Product vs. Feature Teams
  7. only 1 in 26 unhappy customers actually complains

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →