Skip to main content
TwelveLabs provides Pegasus, a video understanding model that analyzes short clips rather than single frames. Use it to reason about motion and events over time, such as answering “what just happened?”, in real-time video calls.
Vision Agents uses Stream Video for real-time WebRTC transport by default. External WebRTC transports are supported as well. Most AI providers offer free tiers to get started.

Installation

uv add "vision-agents[twelvelabs]"
You can get a free API key at twelvelabs.io.

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import twelvelabs, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="Describe what just happened in the video.",
    llm=twelvelabs.PegasusVLM(),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
Set TWELVELABS_API_KEY in your environment or pass api_key directly.

How it works

Unlike frame-by-frame VLMs, Pegasus buffers recent frames from the watched video track, encodes them into a short MP4 clip, uploads it to the TwelveLabs Assets API, and analyzes it with your prompt. The streamed answer is spoken by your agent’s TTS. Pegasus works well for questions about recent activity: “What did they just do?”, “Did anything fall?”, “Describe the last few seconds.”
Wait a few seconds after a participant joins before prompting, so enough video is buffered for analysis.

Parameters

NameTypeDefaultDescription
api_keystrNoneAPI key (defaults to TWELVELABS_API_KEY env var)
model_namestr"pegasus1.5"Pegasus model identifier
fpsfloat1.0Frame sampling rate for the buffered clip
clip_secondsint5Clip length analyzed per request (minimum 4)
max_tokensint512Maximum response tokens (minimum 512)

Trigger on participant join

Prompt Pegasus once a caller’s camera has buffered enough video:
import asyncio

from vision_agents.plugins.getstream import CallSessionParticipantJoinedEvent


@agent.events.subscribe
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
    if event.participant.user.id != "agent":
        await asyncio.sleep(5)
        await agent.simple_response("Describe what just happened in the video")

Notes

  • Pegasus requires a minimum resolution of 360×360; lower-resolution frames are scaled up on encode.
  • Each request uploads a clip and runs server-side analysis, so latency is higher than single-frame VLMs. Tune fps and clip_seconds for your use case.
  • Uploaded clips are deleted after analysis; asset cleanup is best-effort and does not block the response.

Next Steps

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing