Skip to main content
Vision Agents is an open-source Video AI framework for building real-time voice and video applications built and maintained by the team at Stream. It ships with Stream Video as its default low-latency transport, powered by our global edge network. The framework is edge/transport agnostic meaning developers can also bring any edge layer they like.

What can you build?

Vision Agents makes it simple to prototype and scale a wide range of AI-powered video apps, including:
  • Coaching & Training — live sports coaching, guided workouts
  • Collaboration — meeting assistants, note-taking, transcription
  • Automation & Robotics — IoT control, surveillance, manufacturing workflows
  • Video AI — video avatars, character agents

Built-in AI integrations

Out of the box, Vision Agents supports popular providers across the AI stack:
  • LLMs: OpenAI, Anthropic, Gemini, xAI
  • Realtime APIs: Gemini (websockets), OpenAI (WebRTC)
  • Speech-to-Text (STT): Deepgram, Moonshine, Assembly AI
  • Text-to-Speech (TTS): ElevenLabs, Assembly AI, Cartesia, Moonshine
  • Turn / Voice Detection: Fal, Silero, Krisp
  • Audio & Video Processing: YOLO
  • Memory & Context: In-memory, Stream Chat
Each integration is built on extensible base classes. For example, with BaseProcessor or VideoProcessorMixin, you can plug in custom computer-vision models like Ultralytics YOLO. 👉 Ready to dive in? Follow the installation guide to build your first Agent.
I