Skip to main content
Deploy Vision Agents to production using Docker and Kubernetes. For a complete implementation with Helm charts, see the Deploy example.

Key Considerations

FactorRecommendation
RegionUS East for lowest latency (most AI providers default here)
CPU vs GPUCPU for most voice agents; GPU only if running local models
ScalingUse the HTTP Server for multi-session deployments

Docker

Two Dockerfiles are provided: CPU (Dockerfile) - Small, fast to build (~150MB)
FROM python:3.13-slim
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock agent.py ./
EXPOSE 8080
ENV UV_LINK_MODE=copy
CMD ["sh", "-c", "uv sync --frozen && uv run agent.py serve --host 0.0.0.0 --port 8080"]
GPU (Dockerfile.gpu) - For local model inference (~8GB)
FROM pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock agent.py ./
EXPOSE 8080
ENV UV_LINK_MODE=copy
CMD ["sh", "-c", "uv sync --frozen && uv run agent.py serve --host 0.0.0.0 --port 8080"]
Build for Linux (required for cloud deployment):
docker buildx build --platform linux/amd64 -t vision-agent .

Environment Variables

Create a .env file with your API keys:
STREAM_API_KEY=your_key
STREAM_API_SECRET=your_secret
DEEPGRAM_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
GOOGLE_API_KEY=your_key
For Kubernetes, create a secret:
kubectl create secret generic vision-agent-env --from-env-file=.env

Health Checks

The HTTP Server provides health endpoints:
EndpointPurpose
GET /healthLiveness probe - is the server running?
GET /readyReadiness probe - is the agent warmed up?
Kubernetes probe configuration:
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 30

Scaling

Vision Agents supports horizontal scaling via the HTTP Server:
  1. Deploy multiple replicas behind a load balancer
  2. Each replica handles multiple concurrent sessions
  3. Sessions are stateful — use sticky sessions or session affinity if needed
spec:
  replicas: 3
  strategy:
    type: RollingUpdate

GPU Deployment

Only use GPU instances if running local models (Roboflow, local VLMs). Most voice agents use cloud APIs and don’t need GPUs. When deploying to GPU nodes:
resources:
  limits:
    nvidia.com/gpu: 1
Ensure CUDA drivers are installed and the GPU Dockerfile matches your CUDA version.

Monitoring

Combine deployment with Telemetry for production visibility:
  1. Export metrics to Prometheus
  2. Use /sessions/{id}/metrics for per-session debugging
  3. Set up alerts for error rates and latency spikes

Next Steps