Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers. It is the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.
Constructor Parameters
edge(StreamEdge): The edge network provider for video & audio transport (you can choose any provider here)llm(LLM | Realtime): The language model, optionally with realtime capabilitiesagent_user(User): The agent’s user information (name, id, etc.)
instructions(str): System instructions for the agent (default: “Keep your replies short and dont use special characters.”)stt(Optional[STT]): Speech-to-text service (not needed for realtime mode)tts(Optional[TTS]): Text-to-speech service (not needed for realtime mode)turn_detection(Optional[TurnDetector]): Turn detection service for managing conversation turnsvad(Optional[VAD]): Voice activity detection serviceprocessors(Optional[List[Processor]]): List of processors for video/audio processingmcp_servers(Optional[List[MCPBaseServer]]): MCP servers for external tool accessoptions(Optional[AgentOptions]): Configuration options including model directory path
Core Lifecycle Methods
async join(call: Call, participant_wait_timeout: Optional[float] = 10.0) -> AsyncIterator[None]
Joins a video call. Must be called as an async context manager.
The agent can join the call only once.
Once the call is ended, the agent closes itself.
Parameters
call(Call): the call to join.participant_wait_timeout(Optional[float]): timeout in seconds to wait for other participants to join before proceeding.
If0, do not wait at all. IfNone, wait forever.
Default -10.0.
async finish()
Waits for the call to end gracefully. Subscribes to the call ended event.
async close()
Cleans up all connections and resources. Safe to call multiple times.
async create_user()
Creates the agent user in the edge provider if required.
Response Methods
async simple_response(text: str, participant: Optional[Participant] = None)
Sends a simple text response to the LLM for processing.
MCP Integration
The Agent supports Model Context Protocol (MCP) for external tool integration:Event System
TheAgent makes it easy for developers to quickly subscribe and listen to events happening across all components. The event system merges all events across the plugin and core allowing you to listen to events in a single place using their respective type.
Core Events
- Audio Events:
AudioReceivedEvent,TrackAddedEvent - Transcript Events:
STTTranscriptEvent,STTPartialTranscriptEvent - LLM Events:
LLMResponseEvent,LLMResponseChunkEvent - Turn Detection Events:
TurnStartedEvent,TurnEndedEvent - Agent Events:
AgentSayEvent,AgentSayStartedEvent,AgentSayCompletedEvent - Call Events:
CallEndedEvent,CallSessionParticipantJoinedEvent
Event Subscription
Debugging with local video files
For testing and debugging video processing without a live camera, you can use a local video file as the video source. This is useful for reproducible testing and development.Using the CLI
Pass the--video-track-override option when running your agent:

