Skip to main content
This guide shows how to export metrics from Vision Agents to Prometheus for production monitoring. You’ll track latency, token usage, error rates, and more across all agent components.

What You’ll Build

A voice agent with full observability:
  • Real-time metrics exposed at /metrics
  • LLM latency and token tracking
  • STT/TTS performance monitoring
  • Turn detection metrics
  • Error rate tracking

Prerequisites

  • A working Vision Agents setup
  • Python 3.10+
  • Prometheus (optional, for scraping)

Setup

Install dependencies
uv add opentelemetry-exporter-prometheus prometheus-client

Code

Configure OpenTelemetry to enable metric collection. If no providers are configured, metrics are no-ops.
"""Voice agent with Prometheus metrics export."""

# 1. Configure OpenTelemetry
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

# Start Prometheus HTTP server
PROMETHEUS_PORT = 9464
start_http_server(PROMETHEUS_PORT)

# Configure the meter provider
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

# 2. Now import agent modules
from vision_agents.core import Agent, User, AgentLauncher, Runner
from vision_agents.plugins import deepgram, getstream, gemini, elevenlabs


async def create_agent(**kwargs) -> Agent:
    """Create a voice agent with metrics-enabled components."""
    llm = gemini.LLM("gemini-2.5-flash-lite")

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Metrics Agent", id="agent"),
        instructions=(
            "You're a helpful voice assistant. "
            "Keep responses concise and natural."
        ),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(eager_turn_detection=True),
    )

    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join call with metrics collection enabled."""
    # MetricsCollector is automatically attached to the agent
    print(f"Metrics available at: http://localhost:{PROMETHEUS_PORT}/metrics")

    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        await agent.simple_response("Hello! I'm ready to help.")
        await agent.finish()

    # Print summary after call ends
    m = agent.metrics
    print("\n=== Call Summary ===")
    if m.llm_latency_ms__avg.value():
        print(f"LLM latency: {m.llm_latency_ms__avg.value():.0f} ms avg")
    if m.llm_input_tokens__total.value():
        print(f"Tokens: {m.llm_input_tokens__total.value()} in / {m.llm_output_tokens__total.value()} out")
    if m.tts_characters__total.value():
        print(f"TTS characters: {m.tts_characters__total.value()}")


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Running

uv run agent.py --call-type default --call-id test-metrics
Open http://localhost:9464/metrics to see live metrics.

Key Metrics to Monitor

OpenTelemetry metric names use dots (e.g., llm.latency.ms). Prometheus converts these to underscores when scraping (e.g., llm_latency_ms).

Latency Metrics

MetricWhat it measures
llm.latency.msTime from LLM request to complete response
llm.time_to_first_token.msTime until streaming begins
stt.latency.msSpeech-to-text processing time
tts.latency.msText-to-speech synthesis time

Usage Metrics

MetricWhat it measures
llm.tokens.inputPrompt tokens consumed
llm.tokens.outputCompletion tokens generated
tts.charactersCharacters synthesized
llm.tool_callsFunction calls executed

Error Metrics

MetricWhat it measures
llm.errorsLLM API errors
stt.errorsTranscription failures
tts.errorsSynthesis failures

Example Prometheus Queries

Average LLM latency over time:
rate(llm_latency_ms_sum[5m]) / rate(llm_latency_ms_count[5m])
Total tokens used:
sum(llm_tokens_input) + sum(llm_tokens_output)
Error rate:
rate(llm_errors_total[5m])

Grafana Dashboard

Create a dashboard with these panels:
  1. Latency — Line chart showing llm_latency_ms, stt_latency_ms, tts_latency_ms
  2. Token Usage — Stacked bar of input vs output tokens
  3. Error Rate — Error count over time
  4. Active Sessions — Gauge showing realtime_sessions

Production Tips

Add resource attributes for filtering:
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "voice-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

provider = MeterProvider(resource=resource, metric_readers=[reader])
Set up alerting on:
  • LLM latency > 2000ms (p95)
  • Error rate > 1%
  • Token usage anomalies

Next Steps