Skip to main content

Measuring AI System Reliability: The Metrics Every Vendor Should Report

MTBF, MTTR, failure rate, availability, and SLOs for AI systems. What to demand from vendors and how to measure it yourself.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder
· · 6 min read

TL;DR

  • Traditional reliability metrics (MTBF, MTTR, availability) apply to AI systems but require adaptation — an AI system can be “up” while producing degraded outputs, which binary uptime metrics miss
  • AI systems need additional reliability metrics: output quality consistency, model drift rate, hallucination rate, and latency distribution stability
  • Demand these metrics from vendors before procurement — if a vendor cannot report them, their system is not production-ready for enterprise use

Enterprise reliability engineering has decades of established practice: MTBF, MTTR, availability percentages, error budgets, SLOs. These metrics work because traditional software has a clear binary state — it either returns the correct result or it throws an error.

AI systems break this assumption. An LLM-powered system can be “up” (responding to requests, returning 200 status codes) while producing degraded outputs — hallucinations, off-topic responses, or answers that are technically correct but miss what the user needed. Binary reliability metrics cannot capture this gradual degradation. Enterprise AI needs an expanded reliability framework that covers both infrastructure reliability (is the system responding?) and output reliability (are the responses trustworthy?).

0%
typical SaaS uptime SLO
0h
annual downtime at 99.9%
0
of those hours account for quality degradation
0
additional metrics AI systems need beyond uptime

Infrastructure Reliability Metrics

These are the traditional SRE metrics adapted for AI systems. They measure whether the system is operational — responding to requests within acceptable time bounds.

Availability

Availability measures the percentage of time the system is operational. The standard formula: uptime / (uptime + downtime) * 100. For AI systems, define “operational” precisely: the system accepts requests, returns responses within the latency SLO, and does not return error codes.

Common availability tiers:

  • 99.9% (three nines): 8.7 hours of downtime per year. Acceptable for most internal tools and non-critical workflows.
  • 99.95%: 4.4 hours per year. Standard for customer-facing SaaS products.
  • 99.99% (four nines): 52.6 minutes per year. Required for mission-critical systems. Expensive to achieve with LLM providers as dependencies.

The challenge for AI systems: LLM provider outages are outside your control. If your system depends on OpenAI’s API and their availability is 99.9%, your system’s availability cannot exceed 99.9% without a fallback provider. Multi-provider architectures (primary + fallback model) are required for availability above three nines.

Mean Time Between Failures (MTBF)

MTBF measures the average time between system failures. For AI systems, define “failure” to include both infrastructure failures (system down, API timeout, error response) and severe quality failures (hallucination rate exceeding threshold, latency exceeding SLO).

Calculate MTBF over a rolling window (30 days is standard) and track the trend. A declining MTBF indicates the system is becoming less reliable — either infrastructure is degrading or model quality is drifting.

Mean Time to Recovery (MTTR)

MTTR measures how long it takes to restore the system after a failure. For AI systems, MTTR has two components:

  • Infrastructure MTTR: Time to restore service after an outage. Dominated by detection time (how quickly you notice the failure) and provider recovery time (how quickly the LLM provider restores service).
  • Quality MTTR: Time to restore output quality after a quality degradation. This includes detection (how quickly you identify that outputs have degraded), diagnosis (identifying whether the issue is the model, the retrieval pipeline, or the data), and remediation (prompt adjustment, model rollback, or data correction).

Quality MTTR is typically much longer than infrastructure MTTR because quality degradation is harder to detect and harder to fix. A model that starts hallucinating at a slightly higher rate may not trigger alerts for hours or days.

Traditional SRE Metrics

  • ×Binary state: up or down
  • ×Availability = uptime percentage
  • ×MTBF counts infrastructure failures
  • ×MTTR measures time to restore service
  • ×Error budget = acceptable downtime
  • ×Latency measured at p50, p95, p99

AI System Reliability Metrics

  • Gradient state: fully operational to fully degraded
  • Availability = uptime AND output quality above threshold
  • MTBF counts infrastructure AND quality failures
  • MTTR includes quality restoration, not just service restoration
  • Error budget includes quality degradation budget
  • Latency PLUS quality score distribution over time

Output Reliability Metrics

These metrics capture the quality dimension that traditional SRE misses. They measure whether the system’s outputs are trustworthy, consistent, and aligned with user expectations.

Hallucination Rate

The percentage of responses that contain claims not supported by the provided context (for RAG systems) or claims that contradict known facts (for general systems). Measurement requires either human evaluation on a sample or automated claim verification against a ground truth corpus.

What to demand from vendors: Hallucination rate measured on your domain content, not on generic benchmarks. A vendor claiming “less than 2% hallucination rate” should be able to tell you: measured on what data, with what detection method, over what time period. If they cannot answer these questions, the number is marketing, not measurement.

Output Quality Consistency

Quality variance matters as much as average quality. A system that produces 90% quality on average but swings between 60% and 100% depending on the query is less reliable than a system that consistently produces 85% quality. Users cannot trust a system whose quality is unpredictable.

Measure quality consistency as the standard deviation of quality scores over a rolling window. Track it alongside average quality. A stable average with increasing variance is an early warning signal of reliability degradation.

Model Drift Rate

Model drift measures how much the model’s behavior changes over time without intentional updates. For API-based models (OpenAI, Anthropic, Google), the provider may update the model behind the same API endpoint. These silent updates can change your system’s behavior in ways your evaluation suite does not cover.

Chen et al. (2023) in “How Is ChatGPT’s Behavior Changing over Time?” documented measurable performance shifts in GPT-3.5 and GPT-4 across a three-month period on the same evaluation tasks. Response formatting, reasoning accuracy, and instruction following all showed statistically significant changes.

Detection: Run a fixed evaluation suite (your “canary tests”) on a daily or weekly cadence. Track scores over time. Statistically significant score changes without intentional system updates indicate model drift. These canary tests should cover your most important use cases and your most fragile edge cases.

Latency Distribution Stability

Latency spikes in AI systems often correlate with quality changes. When an LLM provider is under load, they may route requests to different hardware or apply different batching strategies, affecting both latency and output quality. A system that suddenly shifts from p99 latency of 2 seconds to p99 of 8 seconds is signaling a change in how requests are being processed.

Track latency distributions (not just averages) and alert on distribution shifts. A bimodal latency distribution (some requests fast, some slow) often indicates inconsistent backend routing that may affect quality.

0
output reliability metrics to track
0h
max detection time for quality degradation
0d
rolling window for drift detection

Defining SLOs for AI Systems

Service Level Objectives for AI systems must cover both infrastructure and output quality. A system that meets its uptime SLO but fails its quality SLO is not meeting its obligations.

Infrastructure SLOs

Standard SRE practice applies:

  • Availability: 99.9% of requests return a non-error response within the latency SLO.
  • Latency: p50 < 1s, p95 < 3s, p99 < 5s (adjust based on your use case).
  • Error rate: < 0.1% of requests return 5xx errors.

Quality SLOs

These are specific to AI systems:

  • Hallucination rate: < X% of responses contain unsupported claims (measured by automated verification or sampled human review).
  • Relevance rate: > Y% of responses are rated “relevant” or “helpful” by users or automated evaluation.
  • Consistency: Quality score standard deviation < Z over a 7-day rolling window.
  • Drift tolerance: Canary test scores do not deviate more than W% from baseline without an intentional model change.
ai_slo_definition.yaml
1# AI System SLO DefinitionSLOs must cover both infrastructure and output quality
2service: customer-support-agent
3version: 2.1
4
5infrastructure_slos:
6 availability:
7 target: 99.95%
8 window: 30d
9 measurement: successful_responses / total_requests
10 latency:
11 p50: 800ms
12 p95: 2500ms
13 p99: 5000msLatency SLOs must account for LLM provider variance
14 error_rate:
15 target: < 0.1%
16 measurement: 5xx_responses / total_requests
17
18quality_slos:Quality SLOs are what separate AI reliability from traditional SRE
19 hallucination_rate:
20 target: < 3%
21 measurement: sampled_human_review (n=200/week)
22 detection_latency: < 24h
23 relevance_rate:
24 target: > 85%
25 measurement: automated_eval + user_feedback
26 consistency:
27 target: quality_stddev < 0.08
28 window: 7d
29 drift_tolerance:
30 target: canary_score_delta < 5%
31 measurement: daily_canary_suite (n=50 cases)

Error Budgets for AI Quality

The error budget concept from SRE applies directly to AI quality SLOs. If your hallucination rate SLO is < 3%, and your measured hallucination rate over the last 30 days is 2.1%, you have consumed 70% of your error budget. This gives you a quantitative basis for decisions: ship a new feature that might increase hallucination risk? Check the error budget first.

When the quality error budget is exhausted:

  1. Freeze deployments: No model changes, prompt changes, or retrieval pipeline changes until the quality budget recovers.
  2. Increase monitoring: Sample more production traffic for quality evaluation. Reduce the detection latency for quality issues.
  3. Root cause analysis: Identify what consumed the error budget — was it a model provider change, a data quality issue, or a prompt regression?
  4. Remediation: Fix the identified issue and verify that quality metrics recover before resuming normal operations.

The Vendor Evaluation Checklist

Before signing an AI vendor contract, demand answers to these questions:

Infrastructure Questions

What is your historical availability (last 12 months)? What is your MTTR for infrastructure outages? Do you have a status page with incident history? What is your latency distribution (p50/p95/p99)? Do you support multi-region failover?

Quality Questions

How do you measure hallucination rate? What is your model update policy — do you notify before changes? Can I pin a specific model version? Do you provide canary testing infrastructure? What quality SLOs do you offer contractually?

Red flags in vendor responses:

  • “We have 99.99% uptime” with no published incident history or status page. Four nines without evidence is marketing.
  • “Our hallucination rate is less than 1%” without specifying the measurement methodology, dataset, or time period.
  • No model versioning or pinning. If the vendor can change the model behind your API endpoint without notice, your reliability is at their discretion.
  • No quality SLOs in the contract. If quality is not in the SLA, the vendor has no contractual obligation to maintain it.
  • No drift monitoring or canary testing infrastructure. If the vendor does not monitor for drift, they will not detect quality degradation until you report it.

Building Your Own Monitoring

If your vendor does not provide quality monitoring, build it yourself. The minimum viable monitoring stack:

  1. Canary test suite: 50-100 test cases covering your most important use cases. Run daily. Alert on score changes > 5%.
  2. Production sampling: Evaluate 1-5% of production traffic with an automated quality scorer (LLM-as-judge or heuristic). Track scores over time.
  3. Latency monitoring: Track latency distribution (not just average). Alert on distribution shifts.
  4. User feedback loop: Structured feedback collection (thumbs up/down, relevance rating). Correlate with automated quality scores to calibrate your automated evaluation.
  5. Incident classification: When issues occur, classify them as infrastructure (system down) or quality (system up but degraded). Track MTTR separately for each class.

Where Clarity Fits

Clarity’s self-model API adds a user-level reliability dimension. Instead of measuring quality in aggregate, track reliability per user: does the system maintain consistent quality for each individual user over time? User-level reliability metrics catch degradation patterns that aggregate metrics hide — a system might maintain 90% average quality while specific user segments experience 70% quality. The self-model enables per-user SLOs that hold the system accountable at the individual level.


Key Takeaways

  • AI systems require reliability metrics beyond traditional SRE — binary uptime misses quality degradation, which is where most enterprise AI failures occur
  • Five output reliability metrics are essential: hallucination rate, output quality consistency, model drift rate, latency distribution stability, and relevance rate
  • SLOs for AI systems must cover both infrastructure (availability, latency, error rate) and quality (hallucination rate, consistency, drift tolerance)
  • Error budgets for quality work the same way as error budgets for availability — freeze deployments when the budget is exhausted
  • Demand these metrics from vendors before procurement — if they cannot report them, their system is not ready for enterprise use

Building AI that needs to understand its users?

Talk to us →

Key insights

“If your AI vendor cannot tell you their mean time to recovery, they have not measured it. If they have not measured it, they do not have a recovery plan.”

Share this insight

“Availability for AI systems is not just uptime. A system that is up but hallucinating is worse than a system that is down — at least downtime is obvious.”

Share this insight

“Traditional SRE metrics assume binary states: working or broken. AI systems degrade gradually. You need metrics that capture the gradient.”

Share this insight

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →