Overview

Vision Agents is an open-source Video AI framework for building real-time voice and video applications built and maintained by the team at Stream. It ships with Stream Video as its default low-latency transport, powered by our global edge network. The framework is edge/transport agnostic meaning developers can also bring any edge layer they like.

What can you build?

Vision Agents makes it simple to prototype and scale a wide range of AI-powered video apps, including:

Coaching & Training — live sports coaching, guided workouts
Collaboration — meeting assistants, note-taking, transcription
Automation & Robotics — IoT control, surveillance, manufacturing workflows
Video AI — video avatars, character agents

Built-in AI integrations

Out of the box, Vision Agents supports popular providers across the AI stack:

LLMs: OpenAI, Anthropic, Gemini, xAI
Realtime APIs: Gemini (websockets), OpenAI (WebRTC)
Speech-to-Text (STT): Deepgram, Moonshine, Assembly AI
Text-to-Speech (TTS): ElevenLabs, Assembly AI, Cartesia, Moonshine
Turn / Voice Detection: Deepgram Flux, Vogent, Smart Turn 3
Audio & Video Processing: YOLO
Memory & Context: In-memory, Stream Chat

Each integration is built on extensible base classes. For example, with BaseProcessor or VideoProcessorMixin, you can plug in custom computer-vision models like Ultralytics YOLO. 👉 Ready to dive in? Follow the installation guide to build your first Agent.

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

What can you build?

Built-in AI integrations

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

​What can you build?

​Built-in AI integrations

What can you build?

Built-in AI integrations