Production Deployment

Deploy Vision Agents to production using Docker and Kubernetes. For a complete implementation with Helm charts, see the Deploy example.

Key Considerations

Factor	Recommendation
Region	US East for lowest latency (most AI providers default here)
CPU vs GPU	CPU for most voice agents; GPU only if running local models
Scaling	Use the HTTP server for multi-session deployments

Docker

Two Dockerfiles are provided: CPU (Dockerfile) - Small, fast to build (~150MB)

FROM python:3.13-slim
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock agent.py ./
EXPOSE 8080
ENV UV_LINK_MODE=copy
CMD ["sh", "-c", "uv sync --frozen && uv run agent.py serve --host 0.0.0.0 --port 8080"]

GPU (Dockerfile.gpu) - For local model inference (~8GB)

FROM pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock agent.py ./
EXPOSE 8080
ENV UV_LINK_MODE=copy
CMD ["sh", "-c", "uv sync --frozen && uv run agent.py serve --host 0.0.0.0 --port 8080"]

Build for Linux (required for cloud deployment):

docker buildx build --platform linux/amd64 -t vision-agent .

Environment Variables

Create a .env file with your API keys:

STREAM_API_KEY=your_key
STREAM_API_SECRET=your_secret
DEEPGRAM_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
GOOGLE_API_KEY=your_key

For Kubernetes, create a secret:

kubectl create secret generic vision-agent-env --from-env-file=.env

Health Checks

The HTTP server provides health endpoints:

Endpoint	Purpose
`GET /health`	Liveness probe - is the server running?
`GET /ready`	Readiness probe - is the agent warmed up?

Kubernetes probe configuration:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 30

Scaling

Vision Agents supports horizontal scaling via the HTTP Server:

Deploy multiple replicas behind a load balancer
Each replica handles multiple concurrent sessions
Use the Redis-backed session registry for cross-node session management

spec:
  replicas: 3
  strategy:
    type: RollingUpdate

GPU Deployment

Only use GPU instances if running local models (Roboflow, local VLMs). Most voice agents use cloud APIs and don’t need GPUs. When deploying to GPU nodes:

resources:
  limits:
    nvidia.com/gpu: 1

Ensure CUDA drivers are installed and the GPU Dockerfile matches your CUDA version.

Monitoring

Combine deployment with Telemetry for production visibility:

Export metrics to Prometheus
Use /calls/{call_id}/sessions/{id}/metrics for per-session debugging
Set up alerts for error rates and latency spikes

Next Steps

Built-in HTTP Server

HTTP server for session management

Horizontal Scaling

Scale across multiple servers with Redis

Deploy Example

Complete Helm chart implementation

Deploying Agents

Voice

Video

Observability

Testing

Tools & Knowledge

Key Considerations

Docker

Environment Variables

Health Checks

Scaling

GPU Deployment

Monitoring

Next Steps

Built-in HTTP Server

Horizontal Scaling

Deploy Example

Deploying Agents

Voice

Video

Observability

Testing

Tools & Knowledge

​Key Considerations

​Docker

​Environment Variables

​Health Checks

​Scaling

​GPU Deployment

​Monitoring

​Next Steps

Built-in HTTP Server

Horizontal Scaling

Deploy Example

Key Considerations

Docker

Environment Variables

Health Checks

Scaling

GPU Deployment

Monitoring

Next Steps