Agent Class
TheAgent
class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers.
Overview
The Agent class serves as the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.Basic Usage
Constructor Parameters
Required Parameters
edge
(StreamEdge
): The edge network provider for video & audio transport (you can choose any provider here)llm
(LLM | Realtime
): The language model, optionally with realtime capabilitiesagent_user
(User
): The agent’s user information (name, id, etc.)
Optional Parameters
instructions
(str
): System instructions for the agent (default: “Keep your replies short and dont use special characters.”)stt
(Optional[STT]
): Speech-to-text service (not needed for realtime mode)tts
(Optional[TTS]
): Text-to-speech service (not needed for realtime mode)turn_detection
(Optional[BaseTurnDetector]
): Turn detection service (not needed for realtime mode)vad
(Optional[VAD]
): Voice activity detection serviceprocessors
(Optional[List[Processor]]
): List of processors for video/audio processingmcp_servers
(Optional[List[MCPBaseServer]]
): MCP servers for external tool access
Key Methods
Core Lifecycle Methods
async join(call: Call) -> AgentSessionContextManager
Joins a video call and returns a context manager for the session.
async finish()
Waits for the call to end gracefully. Subscribes to the call ended event.
async close()
Cleans up all connections and resources. Safe to call multiple times.
async create_user()
Creates the agent user in the edge provider if required.
Response Methods
async simple_response(text: str, participant: Optional[Participant] = None)
Sends a simple text response to the LLM for processing.
Event System
subscribe(function)
Subscribes a callback to the agent-wide event bus. The event bus merges events from edge, LLM, STT, TTS, VAD, and other plugins.
Properties
Mode Detection
realtime_mode
(bool
): ReturnsTrue
if using a Realtime LLM implementationpublish_audio
(bool
): Whether the agent should publish an outbound audio trackpublish_video
(bool
): Whether the agent should publish an outbound video track
Processor Access
audio_processors
(List[Processor]
): Processors that can process audiovideo_processors
(List[Processor]
): Processors that can process videoimage_processors
(List[Processor]
): Processors that can process imagesvideo_publishers
(List[Processor]
): Processors capable of publishing video tracksaudio_publishers
(List[Processor]
): Processors capable of publishing audio tracks
MCP Integration
The Agent supports Model Context Protocol (MCP) for external tool integration:Event System
The Agent provides a comprehensive event system that merges events from all components:Core Events
- Audio Events:
AudioReceivedEvent
,VADAudioEvent
- Transcript Events:
STTTranscriptEvent
,RealtimeTranscriptEvent
- LLM Events:
LLMResponseEvent
,StandardizedTextDeltaEvent
- Agent Events:
AgentSayEvent
,AgentSayStartedEvent
,AgentSayCompletedEvent
- Call Events:
CallEndedEvent
,CallSessionParticipantJoinedEvent