Metrics That Matter: Monitoring AI Model Performance

You've built an AI agent. It's deployed. It's answering questions and processing requests. But how do you know if it's working well? Traditional application monitoring gives you some signals, but AI systems introduce unique challenges that require us to rethink what we measure.

In this post, we'll define the operational metrics that truly matter for LLMs and agentic workflows, grounded in the industry-standard SRE Golden Signals framework.

Why Traditional Metrics Fall Short

When monitoring a typical web service, you might track response times, error rates, and throughput. These metrics work because the relationship between input and output is predictable—a database query takes roughly the same time whether it's your first request or your millionth.

LLMs break this assumption. Consider:

A simple "Hello" might generate 50 tokens in 500ms
A complex analysis request might generate 2,000 tokens over 30 seconds
Both could return HTTP 200, but one costs 40x more

This variability means we need metrics that capture the economics and quality of AI operations, not just their availability.

The SRE Golden Signals: A Quick Refresher

Google's Site Reliability Engineering team defined four Golden Signals that capture the health of any distributed system:

Signal	Definition	Why It Matters
Latency	Time to service a request	User experience and SLA compliance
Traffic	Demand on your system	Capacity planning and scaling
Errors	Rate of failed requests	Reliability and correctness
Saturation	How "full" your system is	Predicting capacity limits

These signals have proven remarkably effective for web services, databases, and microservices. But applying them to AI systems requires translation.

Translating Golden Signals for AI

Latency → Time to Value

For traditional APIs, latency is simple: request in, response out, measure the difference.

For LLMs, latency is multi-dimensional:

Metric	What It Measures	Why It's Different
Time to First Token (TTFT)	Perceived responsiveness	Users tolerate long responses if they see progress
Generation Speed	Tokens per second	Streaming UX depends on consistent token flow
End-to-End Duration	Total job completion	Batch processing and async workflows care about this
Tool Execution Time	Latency of function calls	Agentic workflows chain multiple operations

Key insight: A 10-second response with 200ms TTFT feels faster than a 5-second response that blocks until complete. Measure what matters for your user experience.

Traffic → Token Throughput

"Requests per second" hides the true load on an LLM system. Consider two scenarios:

100 requests/sec, each processing 50 tokens = 5,000 tokens/sec
10 requests/sec, each processing 5,000 tokens (RAG context) = 50,000 tokens/sec

Scenario 2 has 10x fewer requests but 10x the actual load—and 10x the cost.

Metrics to track:

Metric	Description	Cost Implication
Input tokens	Prompt + context size	Direct cost driver
Output tokens	Generation length	Usually higher cost per token
Context utilisation	% of context window used	Efficiency indicator
Requests by operation	Chat vs. embedding vs. function calls	Cost attribution

Key insight: Token throughput is your direct proxy for cost. If you're not tracking tokens, you're flying blind on expenses.

Errors → Multi-Layer Failure Modes

A 200 OK from your LLM provider doesn't mean your agent worked correctly. AI systems have layered failure modes:

Error Type	Example	Detection Method
Provider Errors	Rate limits (429), server errors (500)	HTTP status codes
Model Errors	Context overflow, invalid requests	API error responses
Quality Errors	Hallucinations, irrelevant responses	Evaluation metrics
Logic Errors	Malformed JSON, failed tool calls	Schema validation
Safety Errors	Content policy violations, refusals	Response filtering

Key insight: Track error rates at each layer. A 99.9% provider uptime means nothing if 15% of responses fail quality checks.

Saturation → Resource Boundaries

Traditional saturation metrics focus on CPU, memory, and disk. LLMs have different constraints:

Resource	Saturation Signal	Impact When Exceeded
Context window	% of max tokens used	Truncation, lost context
Rate limits	Requests/tokens per minute	Throttling, queue delays
Concurrent requests	Active connections	Provider-side queuing
Budget limits	Daily/monthly spend	Service degradation

Key insight: Context window saturation is particularly insidious—your system doesn't error, it silently drops information. Monitor it proactively.

The Three Dimensions: Cost, Quality, Performance

Beyond the Golden Signals, AI systems require tracking three interconnected dimensions:

Cost Metrics

Metric	Formula	Target
Cost per request	Total tokens × price per token	Baseline for budgeting
Cost per successful outcome	Total cost ÷ successful completions	True unit economics
Cost efficiency	Output value ÷ input cost	ROI on AI investment

Quality Metrics

Metric	Measurement Approach	Threshold
Response relevance	LLM-as-judge, human eval	Task-specific
Factual accuracy	Ground truth comparison	Critical for RAG
Format compliance	Schema validation rate	100% for structured output
Consistency	Response variance across runs	Application-dependent

Performance Metrics

Metric	SLI Definition	Typical SLO
Availability	Successful requests ÷ total requests	99.9%
Latency P50	Median response time	< 2 seconds
Latency P95	95th percentile response time	< 10 seconds
Throughput	Tokens processed per minute	Capacity-dependent

Building Your Metrics Strategy

Start with SLIs and SLOs

Define Service Level Indicators (SLIs) that map to user experience:

SLI: Proportion of chat requests that return 
     a valid response in under 5 seconds

SLO: 99% of requests meet this criterion 
     over a 30-day rolling window

Create Dashboards for Each Dimension

Operations Dashboard: Golden Signals (latency, traffic, errors, saturation)
Cost Dashboard: Token usage, spend rate, cost per operation
Quality Dashboard: Evaluation scores, failure modes, trends

Set Alerts That Matter

Alert	Condition	Severity
Error rate spike	> 5% errors in 5 minutes	Page
Latency degradation	P95 > 2x baseline	Warning
Cost anomaly	Daily spend > 150% average	Warning
Context saturation	> 90% window usage	Info

Conclusion

Monitoring AI systems requires expanding our observability toolkit beyond traditional application metrics. By grounding our approach in the SRE Golden Signals—Latency, Traffic, Errors, and Saturation—we create a framework that's both rigorous and familiar to operations teams.

The key translations to remember:

Latency → Time to First Token + Generation Speed
Traffic → Token Throughput (your cost proxy)
Errors → Provider failures + Quality failures + Logic failures
Saturation → Context window usage + Rate limits

Combined with explicit tracking of Cost, Quality, and Performance dimensions, you'll have the visibility needed to operate AI systems reliably—and economically.

In Lab 6, we'll get hands-on and implement these metrics using OpenTelemetry, building custom meters that capture the Golden Signals for an LLM agent.