Measuring AI System Reliability: The Metrics Every Vendor Should Report
MTBF, MTTR, failure rate, availability, and SLOs for AI systems. What to demand from vendors and how to measure it yourself.
TL;DR
- Traditional reliability metrics (MTBF, MTTR, availability) apply to AI systems but require adaptation — an AI system can be “up” while producing degraded outputs, which binary uptime metrics miss
- AI systems need additional reliability metrics: output quality consistency, model drift rate, hallucination rate, and latency distribution stability
- Demand these metrics from vendors before procurement — if a vendor cannot report them, their system is not production-ready for enterprise use
Enterprise reliability engineering has decades of established practice: MTBF, MTTR, availability percentages, error budgets, SLOs. These metrics work because traditional software has a clear binary state — it either returns the correct result or it throws an error.
AI systems break this assumption. An LLM-powered system can be “up” (responding to requests, returning 200 status codes) while producing degraded outputs — hallucinations, off-topic responses, or answers that are technically correct but miss what the user needed. Binary reliability metrics cannot capture this gradual degradation. Enterprise AI needs an expanded reliability framework that covers both infrastructure reliability (is the system responding?) and output reliability (are the responses trustworthy?).
Infrastructure Reliability Metrics
These are the traditional SRE metrics adapted for AI systems. They measure whether the system is operational — responding to requests within acceptable time bounds.
Availability
Availability measures the percentage of time the system is operational. The standard formula: uptime / (uptime + downtime) * 100. For AI systems, define “operational” precisely: the system accepts requests, returns responses within the latency SLO, and does not return error codes.
Common availability tiers:
- 99.9% (three nines): 8.7 hours of downtime per year. Acceptable for most internal tools and non-critical workflows.
- 99.95%: 4.4 hours per year. Standard for customer-facing SaaS products.
- 99.99% (four nines): 52.6 minutes per year. Required for mission-critical systems. Expensive to achieve with LLM providers as dependencies.
The challenge for AI systems: LLM provider outages are outside your control. If your system depends on OpenAI’s API and their availability is 99.9%, your system’s availability cannot exceed 99.9% without a fallback provider. Multi-provider architectures (primary + fallback model) are required for availability above three nines.
Mean Time Between Failures (MTBF)
MTBF measures the average time between system failures. For AI systems, define “failure” to include both infrastructure failures (system down, API timeout, error response) and severe quality failures (hallucination rate exceeding threshold, latency exceeding SLO).
Calculate MTBF over a rolling window (30 days is standard) and track the trend. A declining MTBF indicates the system is becoming less reliable — either infrastructure is degrading or model quality is drifting.
Mean Time to Recovery (MTTR)
MTTR measures how long it takes to restore the system after a failure. For AI systems, MTTR has two components:
- Infrastructure MTTR: Time to restore service after an outage. Dominated by detection time (how quickly you notice the failure) and provider recovery time (how quickly the LLM provider restores service).
- Quality MTTR: Time to restore output quality after a quality degradation. This includes detection (how quickly you identify that outputs have degraded), diagnosis (identifying whether the issue is the model, the retrieval pipeline, or the data), and remediation (prompt adjustment, model rollback, or data correction).
Quality MTTR is typically much longer than infrastructure MTTR because quality degradation is harder to detect and harder to fix. A model that starts hallucinating at a slightly higher rate may not trigger alerts for hours or days.
Traditional SRE Metrics
- ×Binary state: up or down
- ×Availability = uptime percentage
- ×MTBF counts infrastructure failures
- ×MTTR measures time to restore service
- ×Error budget = acceptable downtime
- ×Latency measured at p50, p95, p99
AI System Reliability Metrics
- ✓Gradient state: fully operational to fully degraded
- ✓Availability = uptime AND output quality above threshold
- ✓MTBF counts infrastructure AND quality failures
- ✓MTTR includes quality restoration, not just service restoration
- ✓Error budget includes quality degradation budget
- ✓Latency PLUS quality score distribution over time
Output Reliability Metrics
These metrics capture the quality dimension that traditional SRE misses. They measure whether the system’s outputs are trustworthy, consistent, and aligned with user expectations.
Hallucination Rate
The percentage of responses that contain claims not supported by the provided context (for RAG systems) or claims that contradict known facts (for general systems). Measurement requires either human evaluation on a sample or automated claim verification against a ground truth corpus.
What to demand from vendors: Hallucination rate measured on your domain content, not on generic benchmarks. A vendor claiming “less than 2% hallucination rate” should be able to tell you: measured on what data, with what detection method, over what time period. If they cannot answer these questions, the number is marketing, not measurement.
Output Quality Consistency
Quality variance matters as much as average quality. A system that produces 90% quality on average but swings between 60% and 100% depending on the query is less reliable than a system that consistently produces 85% quality. Users cannot trust a system whose quality is unpredictable.
Measure quality consistency as the standard deviation of quality scores over a rolling window. Track it alongside average quality. A stable average with increasing variance is an early warning signal of reliability degradation.
Model Drift Rate
Model drift measures how much the model’s behavior changes over time without intentional updates. For API-based models (OpenAI, Anthropic, Google), the provider may update the model behind the same API endpoint. These silent updates can change your system’s behavior in ways your evaluation suite does not cover.
Chen et al. (2023) in “How Is ChatGPT’s Behavior Changing over Time?” documented measurable performance shifts in GPT-3.5 and GPT-4 across a three-month period on the same evaluation tasks. Response formatting, reasoning accuracy, and instruction following all showed statistically significant changes.
Detection: Run a fixed evaluation suite (your “canary tests”) on a daily or weekly cadence. Track scores over time. Statistically significant score changes without intentional system updates indicate model drift. These canary tests should cover your most important use cases and your most fragile edge cases.
Latency Distribution Stability
Latency spikes in AI systems often correlate with quality changes. When an LLM provider is under load, they may route requests to different hardware or apply different batching strategies, affecting both latency and output quality. A system that suddenly shifts from p99 latency of 2 seconds to p99 of 8 seconds is signaling a change in how requests are being processed.
Track latency distributions (not just averages) and alert on distribution shifts. A bimodal latency distribution (some requests fast, some slow) often indicates inconsistent backend routing that may affect quality.
Defining SLOs for AI Systems
Service Level Objectives for AI systems must cover both infrastructure and output quality. A system that meets its uptime SLO but fails its quality SLO is not meeting its obligations.
Infrastructure SLOs
Standard SRE practice applies:
- Availability: 99.9% of requests return a non-error response within the latency SLO.
- Latency: p50 < 1s, p95 < 3s, p99 < 5s (adjust based on your use case).
- Error rate: < 0.1% of requests return 5xx errors.
Quality SLOs
These are specific to AI systems:
- Hallucination rate: < X% of responses contain unsupported claims (measured by automated verification or sampled human review).
- Relevance rate: > Y% of responses are rated “relevant” or “helpful” by users or automated evaluation.
- Consistency: Quality score standard deviation < Z over a 7-day rolling window.
- Drift tolerance: Canary test scores do not deviate more than W% from baseline without an intentional model change.
1# AI System SLO Definition← SLOs must cover both infrastructure and output quality2service: customer-support-agent3version: 2.145infrastructure_slos:6availability:7target: 99.95%8window: 30d9measurement: successful_responses / total_requests10latency:11p50: 800ms12p95: 2500ms13p99: 5000ms← Latency SLOs must account for LLM provider variance14error_rate:15target: < 0.1%16measurement: 5xx_responses / total_requests1718quality_slos:← Quality SLOs are what separate AI reliability from traditional SRE19hallucination_rate:20target: < 3%21measurement: sampled_human_review (n=200/week)22detection_latency: < 24h23relevance_rate:24target: > 85%25measurement: automated_eval + user_feedback26consistency:27target: quality_stddev < 0.0828window: 7d29drift_tolerance:30target: canary_score_delta < 5%31measurement: daily_canary_suite (n=50 cases)
Error Budgets for AI Quality
The error budget concept from SRE applies directly to AI quality SLOs. If your hallucination rate SLO is < 3%, and your measured hallucination rate over the last 30 days is 2.1%, you have consumed 70% of your error budget. This gives you a quantitative basis for decisions: ship a new feature that might increase hallucination risk? Check the error budget first.
When the quality error budget is exhausted:
- Freeze deployments: No model changes, prompt changes, or retrieval pipeline changes until the quality budget recovers.
- Increase monitoring: Sample more production traffic for quality evaluation. Reduce the detection latency for quality issues.
- Root cause analysis: Identify what consumed the error budget — was it a model provider change, a data quality issue, or a prompt regression?
- Remediation: Fix the identified issue and verify that quality metrics recover before resuming normal operations.
The Vendor Evaluation Checklist
Before signing an AI vendor contract, demand answers to these questions:
Infrastructure Questions
What is your historical availability (last 12 months)? What is your MTTR for infrastructure outages? Do you have a status page with incident history? What is your latency distribution (p50/p95/p99)? Do you support multi-region failover?
Quality Questions
How do you measure hallucination rate? What is your model update policy — do you notify before changes? Can I pin a specific model version? Do you provide canary testing infrastructure? What quality SLOs do you offer contractually?
Red flags in vendor responses:
- “We have 99.99% uptime” with no published incident history or status page. Four nines without evidence is marketing.
- “Our hallucination rate is less than 1%” without specifying the measurement methodology, dataset, or time period.
- No model versioning or pinning. If the vendor can change the model behind your API endpoint without notice, your reliability is at their discretion.
- No quality SLOs in the contract. If quality is not in the SLA, the vendor has no contractual obligation to maintain it.
- No drift monitoring or canary testing infrastructure. If the vendor does not monitor for drift, they will not detect quality degradation until you report it.
Building Your Own Monitoring
If your vendor does not provide quality monitoring, build it yourself. The minimum viable monitoring stack:
- Canary test suite: 50-100 test cases covering your most important use cases. Run daily. Alert on score changes > 5%.
- Production sampling: Evaluate 1-5% of production traffic with an automated quality scorer (LLM-as-judge or heuristic). Track scores over time.
- Latency monitoring: Track latency distribution (not just average). Alert on distribution shifts.
- User feedback loop: Structured feedback collection (thumbs up/down, relevance rating). Correlate with automated quality scores to calibrate your automated evaluation.
- Incident classification: When issues occur, classify them as infrastructure (system down) or quality (system up but degraded). Track MTTR separately for each class.
Where Clarity Fits
Clarity’s self-model API adds a user-level reliability dimension. Instead of measuring quality in aggregate, track reliability per user: does the system maintain consistent quality for each individual user over time? User-level reliability metrics catch degradation patterns that aggregate metrics hide — a system might maintain 90% average quality while specific user segments experience 70% quality. The self-model enables per-user SLOs that hold the system accountable at the individual level.
Key Takeaways
- AI systems require reliability metrics beyond traditional SRE — binary uptime misses quality degradation, which is where most enterprise AI failures occur
- Five output reliability metrics are essential: hallucination rate, output quality consistency, model drift rate, latency distribution stability, and relevance rate
- SLOs for AI systems must cover both infrastructure (availability, latency, error rate) and quality (hallucination rate, consistency, drift tolerance)
- Error budgets for quality work the same way as error budgets for availability — freeze deployments when the budget is exhausted
- Demand these metrics from vendors before procurement — if they cannot report them, their system is not ready for enterprise use
Building AI that needs to understand its users?
Key insights
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →