This guide shows how to export metrics from Vision Agents to Prometheus for production monitoring. You’ll track latency, token usage, error rates, and more across all agent components.
What You’ll Build
A voice agent with full observability:
- Real-time metrics exposed at
/metrics
- LLM latency and token tracking
- STT/TTS performance monitoring
- Turn detection metrics
- Error rate tracking
Prerequisites
- A working Vision Agents setup
- Python 3.10+
- Prometheus (optional, for scraping)
Setup
Install dependencies
uv add opentelemetry-exporter-prometheus prometheus-client
Code
Configure OpenTelemetry to enable metric collection. If no providers are configured, metrics are no-ops.
"""Voice agent with Prometheus metrics export."""
# 1. Configure OpenTelemetry
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server
# Start Prometheus HTTP server
PROMETHEUS_PORT = 9464
start_http_server(PROMETHEUS_PORT)
# Configure the meter provider
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
# 2. Now import agent modules
from vision_agents.core import Agent, User, AgentLauncher, Runner
from vision_agents.plugins import deepgram, getstream, gemini, elevenlabs
async def create_agent(**kwargs) -> Agent:
"""Create a voice agent with metrics-enabled components."""
llm = gemini.LLM("gemini-2.5-flash-lite")
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Metrics Agent", id="agent"),
instructions=(
"You're a helpful voice assistant. "
"Keep responses concise and natural."
),
llm=llm,
tts=elevenlabs.TTS(),
stt=deepgram.STT(eager_turn_detection=True),
)
return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
"""Join call with metrics collection enabled."""
# MetricsCollector is automatically attached to the agent
print(f"Metrics available at: http://localhost:{PROMETHEUS_PORT}/metrics")
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
await agent.simple_response("Hello! I'm ready to help.")
await agent.finish()
# Print summary after call ends
m = agent.metrics
print("\n=== Call Summary ===")
if m.llm_latency_ms__avg.value():
print(f"LLM latency: {m.llm_latency_ms__avg.value():.0f} ms avg")
if m.llm_input_tokens__total.value():
print(f"Tokens: {m.llm_input_tokens__total.value()} in / {m.llm_output_tokens__total.value()} out")
if m.tts_characters__total.value():
print(f"TTS characters: {m.tts_characters__total.value()}")
if __name__ == "__main__":
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Running
uv run agent.py --call-type default --call-id test-metrics
Open http://localhost:9464/metrics to see live metrics.
Key Metrics to Monitor
OpenTelemetry metric names use dots (e.g., llm.latency.ms). Prometheus converts these to underscores when scraping (e.g., llm_latency_ms).
Latency Metrics
| Metric | What it measures |
|---|
llm.latency.ms | Time from LLM request to complete response |
llm.time_to_first_token.ms | Time until streaming begins |
stt.latency.ms | Speech-to-text processing time |
tts.latency.ms | Text-to-speech synthesis time |
Usage Metrics
| Metric | What it measures |
|---|
llm.tokens.input | Prompt tokens consumed |
llm.tokens.output | Completion tokens generated |
tts.characters | Characters synthesized |
llm.tool_calls | Function calls executed |
Error Metrics
| Metric | What it measures |
|---|
llm.errors | LLM API errors |
stt.errors | Transcription failures |
tts.errors | Synthesis failures |
Example Prometheus Queries
Average LLM latency over time:
rate(llm_latency_ms_sum[5m]) / rate(llm_latency_ms_count[5m])
Total tokens used:
sum(llm_tokens_input) + sum(llm_tokens_output)
Error rate:
rate(llm_errors_total[5m])
Grafana Dashboard
Create a dashboard with these panels:
- Latency — Line chart showing
llm_latency_ms, stt_latency_ms, tts_latency_ms
- Token Usage — Stacked bar of input vs output tokens
- Error Rate — Error count over time
- Active Sessions — Gauge showing
realtime_sessions
Production Tips
Add resource attributes for filtering:
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "voice-agent",
"service.version": "1.0.0",
"deployment.environment": "production",
})
provider = MeterProvider(resource=resource, metric_readers=[reader])
Set up alerting on:
- LLM latency > 2000ms (p95)
- Error rate > 1%
- Token usage anomalies
Next Steps