Skip to main content
Build low-latency voice and video AI agents using any model. Vision Agents is an open-source Python framework with 25+ integrations, production-ready deployment, and Stream’s global edge network for sub-500ms latency.

Get Started

Install and build your first agent

GitHub

Star the project and explore examples

X Account

Follow us for updates

What You Can Build

Voice Agents

Customer support bots, phone assistants, and voice interfaces using OpenAI Realtime, Gemini, or STT + LLM + TTS pipelines.

Video AI

Sports coaching, surveillance, manufacturing workflows. Combine YOLO, Roboflow, or Moondream with Gemini or OpenAI vision.

Phone Integration

Inbound and outbound calling via Twilio. Build phone bots with RAG-powered knowledge bases.

Video Avatars

Real-time interactive avatars with HeyGen or video style transfer with Decart.

Examples

ExampleDescription
Simple Voice AgentBasic voice agent with OpenAI or Gemini Realtime
Golf CoachYOLO pose detection + Gemini for real-time coaching
Phone + RAGTwilio calling with TurboPuffer vector search
Security CameraFace recognition, package detection, automated alerts

Capabilities

  • 25+ integrations — OpenAI, Gemini, Anthropic, Deepgram, ElevenLabs, YOLO, and more
  • Two modes — Realtime APIs (WebRTC/WebSocket) or custom STT → LLM → TTS pipelines
  • Video processing — Run YOLO, Roboflow, or custom models on every frame
  • Phone support — Twilio integration for voice calls with bi-directional audio
  • RAG — TurboPuffer vector search and Gemini FileSearch for knowledge retrieval
  • Production ready — HTTP server, Prometheus metrics, Docker deployment with GPU support

Next Steps

Installation

Install the SDK and configure your providers

Integrations

Browse 25+ supported AI providers

Guides

Deploy to production with Docker and metrics

Try Stream Video

Get 333,000 free participant minutes