Vision Agents uses Stream Video for real-time WebRTC transport by default. External WebRTC transports are supported as well. Most AI providers offer free tiers to get started.
Installation
Quick Start
How it works
Unlike frame-by-frame VLMs, Pegasus buffers recent frames from the watched video track, encodes them into a short MP4 clip, uploads it to the TwelveLabs Assets API, and analyzes it with your prompt. The streamed answer is spoken by your agent’s TTS. Pegasus works well for questions about recent activity: “What did they just do?”, “Did anything fall?”, “Describe the last few seconds.”Parameters
| Name | Type | Default | Description |
|---|---|---|---|
api_key | str | None | API key (defaults to TWELVELABS_API_KEY env var) |
model_name | str | "pegasus1.5" | Pegasus model identifier |
fps | float | 1.0 | Frame sampling rate for the buffered clip |
clip_seconds | int | 5 | Clip length analyzed per request (minimum 4) |
max_tokens | int | 512 | Maximum response tokens (minimum 512) |
Trigger on participant join
Prompt Pegasus once a caller’s camera has buffered enough video:Notes
- Pegasus requires a minimum resolution of 360×360; lower-resolution frames are scaled up on encode.
- Each request uploads a clip and runs server-side analysis, so latency is higher than single-frame VLMs. Tune
fpsandclip_secondsfor your use case. - Uploaded clips are deleted after analysis; asset cleanup is best-effort and does not block the response.
Next Steps
Build a Voice Agent
Get started with voice
Build a Video Agent
Add video processing