Key Considerations
| Factor | Recommendation |
|---|---|
| Region | US East for lowest latency (most AI providers default here) |
| CPU vs GPU | CPU for most voice agents; GPU only if running local models |
| Scaling | Use the HTTP Server for multi-session deployments |
Docker
Two Dockerfiles are provided: CPU (Dockerfile) - Small, fast to build (~150MB)
Dockerfile.gpu) - For local model inference (~8GB)
Environment Variables
Create a.env file with your API keys:
Health Checks
The HTTP Server provides health endpoints:| Endpoint | Purpose |
|---|---|
GET /health | Liveness probe - is the server running? |
GET /ready | Readiness probe - is the agent warmed up? |
Scaling
Vision Agents supports horizontal scaling via the HTTP Server:- Deploy multiple replicas behind a load balancer
- Each replica handles multiple concurrent sessions
- Sessions are stateful — use sticky sessions or session affinity if needed
GPU Deployment
Only use GPU instances if running local models (Roboflow, local VLMs). Most voice agents use cloud APIs and don’t need GPUs. When deploying to GPU nodes:Monitoring
Combine deployment with Telemetry for production visibility:- Export metrics to Prometheus
- Use
/sessions/{id}/metricsfor per-session debugging - Set up alerts for error rates and latency spikes

