TwelveLabs

TwelveLabs provides Pegasus, a video understanding model that analyzes short clips rather than single frames. Use it to reason about motion and events over time, such as answering “what just happened?”, in real-time video calls.

Vision Agents uses Stream Video for real-time WebRTC transport by default. External WebRTC transports are supported as well. Most AI providers offer free tiers to get started.

Installation

uv add "vision-agents[twelvelabs]"

You can get a free API key at twelvelabs.io.

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import twelvelabs, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="Describe what just happened in the video.",
    llm=twelvelabs.PegasusVLM(),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

Set TWELVELABS_API_KEY in your environment or pass api_key directly.

How it works

Unlike frame-by-frame VLMs, Pegasus buffers recent frames from the watched video track, encodes them into a short MP4 clip, uploads it to the TwelveLabs Assets API, and analyzes it with your prompt. The streamed answer is spoken by your agent’s TTS. Pegasus works well for questions about recent activity: “What did they just do?”, “Did anything fall?”, “Describe the last few seconds.”

Wait a few seconds after a participant joins before prompting, so enough video is buffered for analysis.

Parameters

Name	Type	Default	Description
`api_key`	`str`	`None`	API key (defaults to `TWELVELABS_API_KEY` env var)
`model_name`	`str`	`"pegasus1.5"`	Pegasus model identifier
`fps`	`float`	`1.0`	Frame sampling rate for the buffered clip
`clip_seconds`	`int`	`5`	Clip length analyzed per request (minimum `4`)
`max_tokens`	`int`	`512`	Maximum response tokens (minimum `512`)

Trigger on participant join

Prompt Pegasus once a caller’s camera has buffered enough video:

import asyncio

from vision_agents.plugins.getstream import CallSessionParticipantJoinedEvent


@agent.events.subscribe
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
    if event.participant.user.id != "agent":
        await asyncio.sleep(5)
        await agent.simple_response("Describe what just happened in the video")

Notes

Pegasus requires a minimum resolution of 360×360; lower-resolution frames are scaled up on encode.
Each request uploads a clip and runs server-side analysis, so latency is higher than single-frame VLMs. Tune fps and clip_seconds for your use case.
Uploaded clips are deleted after analysis; asset cleanup is best-effort and does not block the response.

Installation

Quick Start

How it works

Parameters

Trigger on participant join

Notes

Next Steps

Build a Voice Agent

Build a Video Agent

​Installation

​Quick Start

​How it works

​Parameters

​Trigger on participant join

​Notes

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

Quick Start

How it works

Parameters

Trigger on participant join

Notes

Next Steps