Skip to main content
The Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers. It is the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.
from vision_agents.core import agents
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User

# Traditional STT/TTS mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    processors=[yolo_processor]
)

# Realtime mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[yolo_processor]
)

Constructor Parameters

  • edge (StreamEdge): The edge network provider for video & audio transport (you can choose any provider here)
  • llm (LLM | Realtime): The language model, optionally with realtime capabilities
  • agent_user (User): The agent’s user information (name, id, etc.)
Optional Parameters
  • instructions (str): System instructions for the agent (default: “Keep your replies short and dont use special characters.”)
  • stt (Optional[STT]): Speech-to-text service (not needed for realtime mode)
  • tts (Optional[TTS]): Text-to-speech service (not needed for realtime mode)
  • turn_detection (Optional[TurnDetector]): Turn detection service for managing conversation turns
  • vad (Optional[VAD]): Voice activity detection service
  • processors (Optional[List[Processor]]): List of processors for video/audio processing
  • mcp_servers (Optional[List[MCPBaseServer]]): MCP servers for external tool access
  • options (Optional[AgentOptions]): Configuration options including model directory path

Core Lifecycle Methods

async join(call: Call, participant_wait_timeout: Optional[float] = 10.0) -> AsyncIterator[None] Joins a video call. Must be called as an async context manager. The agent can join the call only once. Once the call is ended, the agent closes itself. Parameters
  • call (Call): the call to join.
  • participant_wait_timeout (Optional[float]): timeout in seconds to wait for other participants to join before proceeding.
    If 0, do not wait at all. If None, wait forever.
    Default - 10.0.
async with agent.join(call):
    # Agent is now active in the call
    await agent.finish()  # Wait for call to end
async finish() Waits for the call to end gracefully. Subscribes to the call ended event. async close() Cleans up all connections and resources. Safe to call multiple times. async create_user() Creates the agent user in the edge provider if required.

Response Methods

async simple_response(text: str, participant: Optional[Participant] = None) Sends a simple text response to the LLM for processing.
await agent.simple_response("Hello, how can I help you?")

MCP Integration

The Agent supports Model Context Protocol (MCP) for external tool integration:
from vision_agents.core.mcp import MCPServerRemote

# Create MCP server
github_server = MCPServerRemote(
    url="https://api.githubcopilot.com/mcp/",
    headers={"Authorization": f"Bearer {github_pat}"}
)

# Add to agent
agent = agents.Agent(
    # ... other parameters
    mcp_servers=[github_server]
)
MCP tools are automatically registered with the LLM’s function registry and can be called during conversations. Please check out our MCP guide for more.

Event System

The Agent makes it easy for developers to quickly subscribe and listen to events happening across all components. The event system merges all events across the plugin and core allowing you to listen to events in a single place using their respective type.

Core Events

  • Audio Events: AudioReceivedEvent, TrackAddedEvent
  • Transcript Events: STTTranscriptEvent, STTPartialTranscriptEvent
  • LLM Events: LLMResponseEvent, LLMResponseChunkEvent
  • Turn Detection Events: TurnStartedEvent, TurnEndedEvent
  • Agent Events: AgentSayEvent, AgentSayStartedEvent, AgentSayCompletedEvent
  • Call Events: CallEndedEvent, CallSessionParticipantJoinedEvent

Event Subscription

@agent.events.subscribe
async def on_audio_received(event: AudioReceivedEvent):
    # Handle audio data
    pass

Debugging with local video files

For testing and debugging video processing without a live camera, you can use a local video file as the video source. This is useful for reproducible testing and development.

Using the CLI

Pass the --video-track-override option when running your agent:
uv run agent.py --video-track-override=/path/to/video.mp4

Using the API

You can also set the video override programmatically:
agent = Agent(...)
agent.set_video_track_override_path("/path/to/video.mp4")
When a video override is set, the local video file plays in a loop at 30 FPS instead of any incoming video tracks from call participants. The track lifecycle remains intact (starts when a user joins, stops when they leave).