Prometheus Metrics - Vision Agents

This guide shows how to export metrics from Vision Agents to Prometheus for production monitoring. You’ll track latency, token usage, error rates, and more across all agent components.

What You’ll Build

A voice agent with full observability:

Real-time metrics exposed at /metrics
LLM latency and token tracking
STT/TTS performance monitoring
Turn detection metrics
Error rate tracking

Prerequisites

A working Vision Agents setup
Python 3.10+
Prometheus (optional, for scraping)

Setup

Install dependencies

uv add opentelemetry-exporter-prometheus prometheus-client

Code

Configure OpenTelemetry to enable metric collection. If no providers are configured, metrics are no-ops.

"""Voice agent with Prometheus metrics export."""

# 1. Configure OpenTelemetry
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

# Start Prometheus HTTP server
PROMETHEUS_PORT = 9464
start_http_server(PROMETHEUS_PORT)

# Configure the meter provider
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

# 2. Now import agent modules
from vision_agents.core import Agent, User, AgentLauncher, Runner
from vision_agents.plugins import deepgram, getstream, gemini, elevenlabs


async def create_agent(**kwargs) -> Agent:
    """Create a voice agent with metrics-enabled components."""
    llm = gemini.LLM("gemini-2.5-flash-lite")

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Metrics Agent", id="agent"),
        instructions=(
            "You're a helpful voice assistant. "
            "Keep responses concise and natural."
        ),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(eager_turn_detection=True),
    )

    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join call with metrics collection enabled."""
    # MetricsCollector is automatically attached to the agent
    print(f"Metrics available at: http://localhost:{PROMETHEUS_PORT}/metrics")

    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        await agent.simple_response("Hello! I'm ready to help.")
        await agent.finish()

    # Print summary after call ends
    m = agent.metrics
    print("\n=== Call Summary ===")
    if m.llm_latency_ms__avg.value():
        print(f"LLM latency: {m.llm_latency_ms__avg.value():.0f} ms avg")
    if m.llm_input_tokens__total.value():
        print(f"Tokens: {m.llm_input_tokens__total.value()} in / {m.llm_output_tokens__total.value()} out")
    if m.tts_characters__total.value():
        print(f"TTS characters: {m.tts_characters__total.value()}")


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Running

uv run agent.py --call-type default --call-id test-metrics

Open http://localhost:9464/metrics to see live metrics.

Key Metrics to Monitor

OpenTelemetry metric names use dots (e.g., llm.latency.ms). Prometheus converts these to underscores when scraping (e.g., llm_latency_ms).

Latency Metrics

Metric	What it measures
`llm.latency.ms`	Time from LLM request to complete response
`llm.time_to_first_token.ms`	Time until streaming begins
`stt.latency.ms`	Speech-to-text processing time
`tts.latency.ms`	Text-to-speech synthesis time

Usage Metrics

Metric	What it measures
`llm.tokens.input`	Prompt tokens consumed
`llm.tokens.output`	Completion tokens generated
`tts.characters`	Characters synthesized
`llm.tool_calls`	Function calls executed

Error Metrics

Metric	What it measures
`llm.errors`	LLM API errors
`stt.errors`	Transcription failures
`tts.errors`	Synthesis failures

Example Prometheus Queries

Average LLM latency over time:

rate(llm_latency_ms_sum[5m]) / rate(llm_latency_ms_count[5m])

Total tokens used:

sum(llm_tokens_input) + sum(llm_tokens_output)

Error rate:

rate(llm_errors_total[5m])

Grafana Dashboard

Create a dashboard with these panels:

Latency — Line chart showing llm_latency_ms, stt_latency_ms, tts_latency_ms
Token Usage — Stacked bar of input vs output tokens
Error Rate — Error count over time
Active Sessions — Gauge showing realtime_sessions

Production Tips

Add resource attributes for filtering:

from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "voice-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

provider = MeterProvider(resource=resource, metric_readers=[reader])

Set up alerting on:

LLM latency > 2000ms (p95)
Error rate > 1%
Token usage anomalies

Next Steps

Telemetry Reference — Full metrics documentation
Event System — Build custom metrics from events

How-to Guides

​What You’ll Build

​Prerequisites

​Setup

​Code

​Running

​Key Metrics to Monitor

​Latency Metrics

​Usage Metrics

​Error Metrics

​Example Prometheus Queries

​Grafana Dashboard

​Production Tips

​Next Steps

What You’ll Build

Prerequisites

Setup

Code

Running

Key Metrics to Monitor

Latency Metrics

Usage Metrics

Error Metrics

Example Prometheus Queries

Grafana Dashboard

Production Tips

Next Steps