Steve's thoughts and experiments
Metrics That Matter: Monitoring AI Model Performance image

Metrics That Matter: Monitoring AI Model Performance

You've built an AI agent. It's deployed. It's answering questions and processing requests. But how do you know if it's working well? Traditional application monitoring gives you some signals, but AI systems introduce unique challenges that require us to rethink what we measure.

In this post, we'll define the operational metrics that truly matter for LLMs and agentic workflows, grounded in the industry-standard SRE Golden Signals framework.

Why Traditional Metrics Fall Short

When monitoring a typical web service, you might track response times, error rates, and throughput. These metrics work because the relationship between input and output is predictable—a database query takes roughly the same time whether it's your first request or your millionth.

LLMs break this assumption. Consider:

  • A simple "Hello" might generate 50 tokens in 500ms
  • A complex analysis request might generate 2,000 tokens over 30 seconds
  • Both could return HTTP 200, but one costs 40x more

This variability means we need metrics that capture the economics and quality of AI operations, not just their availability.

The SRE Golden Signals: A Quick Refresher

Google's Site Reliability Engineering team defined four Golden Signals that capture the health of any distributed system:

Signal Definition Why It Matters
Latency Time to service a request User experience and SLA compliance
Traffic Demand on your system Capacity planning and scaling
Errors Rate of failed requests Reliability and correctness
Saturation How "full" your system is Predicting capacity limits

These signals have proven remarkably effective for web services, databases, and microservices. But applying them to AI systems requires translation.

Translating Golden Signals for AI

Latency → Time to Value

For traditional APIs, latency is simple: request in, response out, measure the difference.

For LLMs, latency is multi-dimensional:

Metric What It Measures Why It's Different
Time to First Token (TTFT) Perceived responsiveness Users tolerate long responses if they see progress
Generation Speed Tokens per second Streaming UX depends on consistent token flow
End-to-End Duration Total job completion Batch processing and async workflows care about this
Tool Execution Time Latency of function calls Agentic workflows chain multiple operations

Key insight: A 10-second response with 200ms TTFT feels faster than a 5-second response that blocks until complete. Measure what matters for your user experience.

Traffic → Token Throughput

"Requests per second" hides the true load on an LLM system. Consider two scenarios:

  1. 100 requests/sec, each processing 50 tokens = 5,000 tokens/sec
  2. 10 requests/sec, each processing 5,000 tokens (RAG context) = 50,000 tokens/sec

Scenario 2 has 10x fewer requests but 10x the actual load—and 10x the cost.

Metrics to track:

Metric Description Cost Implication
Input tokens Prompt + context size Direct cost driver
Output tokens Generation length Usually higher cost per token
Context utilisation % of context window used Efficiency indicator
Requests by operation Chat vs. embedding vs. function calls Cost attribution

Key insight: Token throughput is your direct proxy for cost. If you're not tracking tokens, you're flying blind on expenses.

Errors → Multi-Layer Failure Modes

A 200 OK from your LLM provider doesn't mean your agent worked correctly. AI systems have layered failure modes:

Error Type Example Detection Method
Provider Errors Rate limits (429), server errors (500) HTTP status codes
Model Errors Context overflow, invalid requests API error responses
Quality Errors Hallucinations, irrelevant responses Evaluation metrics
Logic Errors Malformed JSON, failed tool calls Schema validation
Safety Errors Content policy violations, refusals Response filtering

Key insight: Track error rates at each layer. A 99.9% provider uptime means nothing if 15% of responses fail quality checks.

Saturation → Resource Boundaries

Traditional saturation metrics focus on CPU, memory, and disk. LLMs have different constraints:

Resource Saturation Signal Impact When Exceeded
Context window % of max tokens used Truncation, lost context
Rate limits Requests/tokens per minute Throttling, queue delays
Concurrent requests Active connections Provider-side queuing
Budget limits Daily/monthly spend Service degradation

Key insight: Context window saturation is particularly insidious—your system doesn't error, it silently drops information. Monitor it proactively.

The Three Dimensions: Cost, Quality, Performance

Beyond the Golden Signals, AI systems require tracking three interconnected dimensions:

Cost Metrics

Metric Formula Target
Cost per request Total tokens × price per token Baseline for budgeting
Cost per successful outcome Total cost ÷ successful completions True unit economics
Cost efficiency Output value ÷ input cost ROI on AI investment

Quality Metrics

Metric Measurement Approach Threshold
Response relevance LLM-as-judge, human eval Task-specific
Factual accuracy Ground truth comparison Critical for RAG
Format compliance Schema validation rate 100% for structured output
Consistency Response variance across runs Application-dependent

Performance Metrics

Metric SLI Definition Typical SLO
Availability Successful requests ÷ total requests 99.9%
Latency P50 Median response time < 2 seconds
Latency P95 95th percentile response time < 10 seconds
Throughput Tokens processed per minute Capacity-dependent

Building Your Metrics Strategy

Start with SLIs and SLOs

Define Service Level Indicators (SLIs) that map to user experience:

SLI: Proportion of chat requests that return 
     a valid response in under 5 seconds

SLO: 99% of requests meet this criterion 
     over a 30-day rolling window

Create Dashboards for Each Dimension

  1. Operations Dashboard: Golden Signals (latency, traffic, errors, saturation)
  2. Cost Dashboard: Token usage, spend rate, cost per operation
  3. Quality Dashboard: Evaluation scores, failure modes, trends

Set Alerts That Matter

Alert Condition Severity
Error rate spike > 5% errors in 5 minutes Page
Latency degradation P95 > 2x baseline Warning
Cost anomaly Daily spend > 150% average Warning
Context saturation > 90% window usage Info

Conclusion

Monitoring AI systems requires expanding our observability toolkit beyond traditional application metrics. By grounding our approach in the SRE Golden Signals—Latency, Traffic, Errors, and Saturation—we create a framework that's both rigorous and familiar to operations teams.

The key translations to remember:

  • Latency → Time to First Token + Generation Speed
  • Traffic → Token Throughput (your cost proxy)
  • Errors → Provider failures + Quality failures + Logic failures
  • Saturation → Context window usage + Rate limits

Combined with explicit tracking of Cost, Quality, and Performance dimensions, you'll have the visibility needed to operate AI systems reliably—and economically.

In Lab 6, we'll get hands-on and implement these metrics using OpenTelemetry, building custom meters that capture the Golden Signals for an LLM agent.