Metrics That Matter: Monitoring AI Model Performance
You've built an AI agent. It's deployed. It's answering questions and processing requests. But how do you know if it's working well? Traditional application monitoring gives you some signals, but AI systems introduce unique challenges that require us to rethink what we measure.
In this post, we'll define the operational metrics that truly matter for LLMs and agentic workflows, grounded in the industry-standard SRE Golden Signals framework.
Why Traditional Metrics Fall Short
When monitoring a typical web service, you might track response times, error rates, and throughput. These metrics work because the relationship between input and output is predictable—a database query takes roughly the same time whether it's your first request or your millionth.
LLMs break this assumption. Consider:
- A simple "Hello" might generate 50 tokens in 500ms
- A complex analysis request might generate 2,000 tokens over 30 seconds
- Both could return HTTP 200, but one costs 40x more
This variability means we need metrics that capture the economics and quality of AI operations, not just their availability.
The SRE Golden Signals: A Quick Refresher
Google's Site Reliability Engineering team defined four Golden Signals that capture the health of any distributed system:
| Signal | Definition | Why It Matters |
|---|---|---|
| Latency | Time to service a request | User experience and SLA compliance |
| Traffic | Demand on your system | Capacity planning and scaling |
| Errors | Rate of failed requests | Reliability and correctness |
| Saturation | How "full" your system is | Predicting capacity limits |
These signals have proven remarkably effective for web services, databases, and microservices. But applying them to AI systems requires translation.
Translating Golden Signals for AI
Latency → Time to Value
For traditional APIs, latency is simple: request in, response out, measure the difference.
For LLMs, latency is multi-dimensional:
| Metric | What It Measures | Why It's Different |
|---|---|---|
| Time to First Token (TTFT) | Perceived responsiveness | Users tolerate long responses if they see progress |
| Generation Speed | Tokens per second | Streaming UX depends on consistent token flow |
| End-to-End Duration | Total job completion | Batch processing and async workflows care about this |
| Tool Execution Time | Latency of function calls | Agentic workflows chain multiple operations |
Key insight: A 10-second response with 200ms TTFT feels faster than a 5-second response that blocks until complete. Measure what matters for your user experience.
Traffic → Token Throughput
"Requests per second" hides the true load on an LLM system. Consider two scenarios:
- 100 requests/sec, each processing 50 tokens = 5,000 tokens/sec
- 10 requests/sec, each processing 5,000 tokens (RAG context) = 50,000 tokens/sec
Scenario 2 has 10x fewer requests but 10x the actual load—and 10x the cost.
Metrics to track:
| Metric | Description | Cost Implication |
|---|---|---|
| Input tokens | Prompt + context size | Direct cost driver |
| Output tokens | Generation length | Usually higher cost per token |
| Context utilisation | % of context window used | Efficiency indicator |
| Requests by operation | Chat vs. embedding vs. function calls | Cost attribution |
Key insight: Token throughput is your direct proxy for cost. If you're not tracking tokens, you're flying blind on expenses.
Errors → Multi-Layer Failure Modes
A 200 OK from your LLM provider doesn't mean your agent worked correctly. AI systems have layered failure modes:
| Error Type | Example | Detection Method |
|---|---|---|
| Provider Errors | Rate limits (429), server errors (500) | HTTP status codes |
| Model Errors | Context overflow, invalid requests | API error responses |
| Quality Errors | Hallucinations, irrelevant responses | Evaluation metrics |
| Logic Errors | Malformed JSON, failed tool calls | Schema validation |
| Safety Errors | Content policy violations, refusals | Response filtering |
Key insight: Track error rates at each layer. A 99.9% provider uptime means nothing if 15% of responses fail quality checks.
Saturation → Resource Boundaries
Traditional saturation metrics focus on CPU, memory, and disk. LLMs have different constraints:
| Resource | Saturation Signal | Impact When Exceeded |
|---|---|---|
| Context window | % of max tokens used | Truncation, lost context |
| Rate limits | Requests/tokens per minute | Throttling, queue delays |
| Concurrent requests | Active connections | Provider-side queuing |
| Budget limits | Daily/monthly spend | Service degradation |
Key insight: Context window saturation is particularly insidious—your system doesn't error, it silently drops information. Monitor it proactively.
The Three Dimensions: Cost, Quality, Performance
Beyond the Golden Signals, AI systems require tracking three interconnected dimensions:
Cost Metrics
| Metric | Formula | Target |
|---|---|---|
| Cost per request | Total tokens × price per token | Baseline for budgeting |
| Cost per successful outcome | Total cost ÷ successful completions | True unit economics |
| Cost efficiency | Output value ÷ input cost | ROI on AI investment |
Quality Metrics
| Metric | Measurement Approach | Threshold |
|---|---|---|
| Response relevance | LLM-as-judge, human eval | Task-specific |
| Factual accuracy | Ground truth comparison | Critical for RAG |
| Format compliance | Schema validation rate | 100% for structured output |
| Consistency | Response variance across runs | Application-dependent |
Performance Metrics
| Metric | SLI Definition | Typical SLO |
|---|---|---|
| Availability | Successful requests ÷ total requests | 99.9% |
| Latency P50 | Median response time | < 2 seconds |
| Latency P95 | 95th percentile response time | < 10 seconds |
| Throughput | Tokens processed per minute | Capacity-dependent |
Building Your Metrics Strategy
Start with SLIs and SLOs
Define Service Level Indicators (SLIs) that map to user experience:
SLI: Proportion of chat requests that return
a valid response in under 5 seconds
SLO: 99% of requests meet this criterion
over a 30-day rolling window
Create Dashboards for Each Dimension
- Operations Dashboard: Golden Signals (latency, traffic, errors, saturation)
- Cost Dashboard: Token usage, spend rate, cost per operation
- Quality Dashboard: Evaluation scores, failure modes, trends
Set Alerts That Matter
| Alert | Condition | Severity |
|---|---|---|
| Error rate spike | > 5% errors in 5 minutes | Page |
| Latency degradation | P95 > 2x baseline | Warning |
| Cost anomaly | Daily spend > 150% average | Warning |
| Context saturation | > 90% window usage | Info |
Conclusion
Monitoring AI systems requires expanding our observability toolkit beyond traditional application metrics. By grounding our approach in the SRE Golden Signals—Latency, Traffic, Errors, and Saturation—we create a framework that's both rigorous and familiar to operations teams.
The key translations to remember:
- Latency → Time to First Token + Generation Speed
- Traffic → Token Throughput (your cost proxy)
- Errors → Provider failures + Quality failures + Logic failures
- Saturation → Context window usage + Rate limits
Combined with explicit tracking of Cost, Quality, and Performance dimensions, you'll have the visibility needed to operate AI systems reliably—and economically.
In Lab 6, we'll get hands-on and implement these metrics using OpenTelemetry, building custom meters that capture the Golden Signals for an LLM agent.