What can you build?
Vision Agents makes it simple to prototype and scale a wide range of AI-powered video apps, including:- Coaching & Training — live sports coaching, guided workouts
- Collaboration — meeting assistants, note-taking, transcription
- Automation & Robotics — IoT control, surveillance, manufacturing workflows
- Video AI — video avatars, character agents
Get Started
Installation
Install Vision Agents and set up your first project
Voice Agents
Build real-time voice agents with AI
Video Agents
Create AI-powered video applications
Integrations
Connect with popular AI providers
Built-in AI integrations
Out of the box, Vision Agents supports popular providers across the AI stack:- LLMs: OpenAI, Anthropic, Gemini, xAI
- Realtime APIs: Gemini (websockets), OpenAI (WebRTC)
- Speech-to-Text (STT): Deepgram, Moonshine, Assembly AI
- Text-to-Speech (TTS): ElevenLabs, Assembly AI, Cartesia, Moonshine
- Turn / Voice Detection: Fal, Silero, Krisp
- Audio & Video Processing: YOLO
- Memory & Context: In-memory, Stream Chat
BaseProcessor
or VideoProcessorMixin
, you can plug in custom computer-vision models like Ultralytics YOLO.