Lab 6 The Golden Signals of LLM Operations
In Lab 5, we turned the lights on. We instrumented our agent with OpenTelemetry and visualised the execution traces in .NET Aspire. We can see what happened.
But in a production system, "seeing what happened" isn't enough. You need to know if the system is healthy. In traditional software engineering, we rely on Google's SRE Golden Signals: Latency, Traffic, Errors, and Saturation.
Do these apply to Stochastic Parrots? Yes, but they require translation. In this lab, we will define the operational dimensions of an LLM Agent and implement custom metrics to track them.
The Translation Layer
An LLM Agent is effectively a web service, but its resource consumption profile is vastly different. Here is how the Golden Signals translate from a standard Web API to an LLM Agent.
1. Latency → Time to First Token (TTFT) & Generation Speed
For a standard API, latency is binary: Request → Response. For an LLM, latency is nuanced. A 10-second response might be acceptable if the user sees the first word in 200ms (Streaming).
- The Metric: We need to track TTFT (perceived latency) separate from End-to-End Duration (job completion).
2. Traffic → Token Throughput
"Requests per second" is a useful metric, but it hides the true load. One request might process 50 tokens; another might process 50,000 (RAG context).
- The Metric: We track Input Tokens (Load) and Output Tokens (Generation work) separately. This is also your direct proxy for Cost.
3. Errors → Provider Failures vs. Logic Failures
A 200 OK from OpenAI doesn't mean your agent worked.
- The Metric: We track standard HTTP errors (500s, 429s) but also "Logic Errors"—for example, if the model refuses to answer or returns malformed JSON when you asked for structured output.
4. Saturation → Context Window & Rate Limits
In web services, saturation might be CPU or Memory. In LLMs, saturation is often the Context Window.
- The Metric: Percentage of Context Window used. If your RAG retrieval consistently fills 95% of the model's context window, you are dangerously close to losing information (truncation).
Implementing the Signals
Let's upgrade our previous OTel setup. While auto-instrumentation gives us traces, we often need to manually define Meters to capture these specific golden signals effectively.
We will add a MeterProvider and create a custom counter for token usage.
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter
# 1. Setup Metric Reader (Using Console for demo, Aspire in prod)
reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
# 2. Create a Meter
meter = metrics.get_meter("llm.agent.operational")
# 3. Define our Golden Signal Counters
token_counter = meter.create_counter(
"llm_token_usage",
description="Counts the number of tokens used",
unit="token"
)
latency_histogram = meter.create_histogram(
"llm_request_duration",
description="Distribution of LLM request durations",
unit="ms"
)
Viewing in Aspire
When you wire this up to your Aspire dashboard (using the OTLP Exporter instead of Console), you won't just see a list of traces anymore.
You can now navigate to the Metrics tab and plot:
- Cost Velocity: llm_token_usage over time (grouped by type).
- Performance Stability: llm_request_duration (P95 vs P50 latency).
This allows you to answer the hard questions: "Did our latest prompt engineering change make the agent slower?" or "Why did our costs double on Tuesday?"
{
"resource_metrics": [
{
"resource": {
"attributes": {
"service.name": "otel-llm"
},
"schema_url": ""
},
"scope_metrics": [
{
"scope": {
"name": "llm.agent.operational",
"version": "",
"schema_url": "",
"attributes": null
},
"metrics": [
{
"name": "llm_request_duration",
"description": "Distribution of LLM request durations",
"unit": "ms",
"data": {
"data_points": [
{
"attributes": {
"model": "gemini-2.0-flash"
},
"start_time_unix_nano": 1765504349659287676,
"time_unix_nano": 1765504349659745718,
"count": 1,
"sum": 1145.103931427002,
"bucket_counts": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
0,
0,
0
],
"explicit_bounds": [
0.0,
5.0,
10.0,
25.0,
50.0,
75.0,
100.0,
250.0,
500.0,
750.0,
1000.0,
2500.0,
5000.0,
7500.0,
10000.0
],
"min": 1145.103931427002,
"max": 1145.103931427002,
"exemplars": []
}
],
"aggregation_temporality": 2
}
},
{
"name": "llm_token_usage",
"description": "Counts the number of tokens used",
"unit": "token",
"data": {
"data_points": [
{
"attributes": {
"model": "gemini-2.0-flash",
"type": "total"
},
"start_time_unix_nano": 1765504349659381801,
"time_unix_nano": 1765504349659745718,
"value": 39,
"exemplars": []
},
{
"attributes": {
"model": "gemini-2.0-flash",
"type": "prompt"
},
"start_time_unix_nano": 1765504349659402885,
"time_unix_nano": 1765504349659745718,
"value": 7,
"exemplars": []
},
{
"attributes": {
"model": "gemini-2.0-flash",
"type": "completion"
},
"start_time_unix_nano": 1765504349659415301,
"time_unix_nano": 1765504349659745718,
"value": 32,
"exemplars": []
}
],
"aggregation_temporality": 2,
"is_monotonic": true
}
}
],
"schema_url": ""
}
],
"schema_url": ""
}
]
}
Observability is about more than debugging errors; it's about understanding the economics and performance of your system. By translating the Golden Signals to LLM terms, we treat our Agent not as a magic box, but as a reliable service.