Agent Class

The Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers.

Overview

The Agent class serves as the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.

Basic Usage

from vision_agents.core import agents
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User

# Traditional STT/TTS mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    vad=silero.VAD(),
    processors=[yolo_processor]
)

# Realtime mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[yolo_processor]
)

Constructor Parameters

Required Parameters

edge (StreamEdge): The edge network provider for video & audio transport (you can choose any provider here)
llm (LLM | Realtime): The language model, optionally with realtime capabilities
agent_user (User): The agent’s user information (name, id, etc.)

Optional Parameters

instructions (str): System instructions for the agent (default: “Keep your replies short and dont use special characters.”)
stt (Optional[STT]): Speech-to-text service (not needed for realtime mode)
tts (Optional[TTS]): Text-to-speech service (not needed for realtime mode)
turn_detection (Optional[TurnDetector]): Turn detection service for managing conversation turns
vad (Optional[VAD]): Voice activity detection service
processors (Optional[List[Processor]]): List of processors for video/audio processing
mcp_servers (Optional[List[MCPBaseServer]]): MCP servers for external tool access
options (Optional[AgentOptions]): Configuration options including model directory path

Key Methods

Core Lifecycle Methods

`async join(call: Call) -> AgentSessionContextManager`

Joins a video call and returns a context manager for the session.

with await agent.join(call):
    # Agent is now active in the call
    await agent.finish()  # Wait for call to end

`async finish()`

Waits for the call to end gracefully. Subscribes to the call ended event.

`async close()`

Cleans up all connections and resources. Safe to call multiple times.

`async create_user()`

Creates the agent user in the edge provider if required.

Response Methods

`async simple_response(text: str, participant: Optional[Participant] = None)`

Sends a simple text response to the LLM for processing.

await agent.simple_response("Hello, how can I help you?")

Event System

Subscribes a callback to the agent-wide event bus. The event bus merges events from edge, LLM, STT, TTS, VAD, turn detection, and other plugins.

@agent.events.subscribe
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
    await agent.simple_response(f"Hello, {event.participant.user.name}!")

Properties

Mode Detection

realtime_mode (bool): Returns True if using a Realtime LLM implementation
publish_audio (bool): Whether the agent should publish an outbound audio track
publish_video (bool): Whether the agent should publish an outbound video track

Processor Access

audio_processors (List[Processor]): Processors that can process audio
video_processors (List[Processor]): Processors that can process video
image_processors (List[Processor]): Processors that can process images
video_publishers (List[Processor]): Processors capable of publishing video tracks
audio_publishers (List[Processor]): Processors capable of publishing audio tracks

MCP Integration

The Agent supports Model Context Protocol (MCP) for external tool integration:

from vision_agents.core.mcp import MCPServerRemote

# Create MCP server
github_server = MCPServerRemote(
    url="https://api.githubcopilot.com/mcp/",
    headers={"Authorization": f"Bearer {github_pat}"}
)

# Add to agent
agent = agents.Agent(
    # ... other parameters
    mcp_servers=[github_server]
)

MCP tools are automatically registered with the LLM’s function registry and can be called during conversations. Please check out our MCP guide for more.

Event System

The Agent provides a comprehensive event system that merges events from all components:

Core Events

Audio Events: AudioReceivedEvent, VADAudioEvent
Transcript Events: STTTranscriptEvent, STTPartialTranscriptEvent
LLM Events: LLMResponseEvent, LLMResponseChunkEvent
Turn Detection Events: TurnStartedEvent, TurnEndedEvent
Agent Events: AgentSayEvent, AgentSayStartedEvent, AgentSayCompletedEvent
Call Events: CallEndedEvent, CallSessionParticipantJoinedEvent

Event Subscription

@agent.events.subscribe
async def on_audio_received(event: AudioReceivedEvent):
    # Handle audio data
    pass

@agent.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
    # Handle transcript
    await agent.simple_response(event.text, event.participant)

Configuration Modes

Traditional Mode (STT/TTS)

Uses separate STT and TTS services with an LLM:

agent = agents.Agent(
    edge=getstream.Edge(),
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    vad=silero.VAD(),
    # ... other parameters
)

Realtime Mode (Speech-to-Speech)

Uses a realtime LLM that handles speech directly:

agent = agents.Agent(
    edge=getstream.Edge(),
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    # STT, TTS, and VAD not needed
    # ... other parameters
)

Agent Example

import asyncio
from uuid import uuid4
from vision_agents.core import agents
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User
from vision_agents.core.events import CallSessionParticipantJoinedEvent

async def main():
    # Create agent
    agent = agents.Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI Assistant", id="agent"),
        instructions="You're a helpful voice AI assistant. Keep responses short and conversational.",
        llm=openai.LLM(model="gpt-4o-mini"),
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
        processors=[]
    )

    # Set up event handlers
    @agent.subscribe
    async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
        await agent.simple_response(f"Hello, {event.participant.user.name}!")

    # Create and join call
    await agent.create_user()
    call = agent.edge.client.video.call("default", str(uuid4()))
    await agent.edge.open_demo(call)

    with await agent.join(call):
        await agent.finish()

    await agent.close()

if __name__ == "__main__":
    asyncio.run(main())

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

Agent Class

Agent Class

Overview

Basic Usage

Constructor Parameters

Required Parameters

Optional Parameters

Key Methods

Core Lifecycle Methods

`async join(call: Call) -> AgentSessionContextManager`

`async finish()`

`async close()`

`async create_user()`

Response Methods

`async simple_response(text: str, participant: Optional[Participant] = None)`

Event System

Properties

Mode Detection

Processor Access

MCP Integration

Event System

Core Events

Event Subscription

Configuration Modes

Traditional Mode (STT/TTS)

Realtime Mode (Speech-to-Speech)

Agent Example

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

​Agent Class

​Overview

​Basic Usage

​Constructor Parameters

​Required Parameters

​Optional Parameters

​Key Methods

​Core Lifecycle Methods

​async join(call: Call) -> AgentSessionContextManager

​async finish()

​async close()

​async create_user()

​Response Methods

​async simple_response(text: str, participant: Optional[Participant] = None)

​Event System

​events.subscribe

​Properties

​Mode Detection

​Processor Access

​MCP Integration

​Event System

​Core Events

​Event Subscription

​Configuration Modes

​Traditional Mode (STT/TTS)

​Realtime Mode (Speech-to-Speech)

​Agent Example

Agent Class

Overview

Basic Usage

Constructor Parameters

Required Parameters

Optional Parameters

Key Methods

Core Lifecycle Methods

`async join(call: Call) -> AgentSessionContextManager`

`async finish()`

`async close()`

`async create_user()`

Response Methods

`async simple_response(text: str, participant: Optional[Participant] = None)`

Event System

`events.subscribe`

Properties

Mode Detection

Processor Access

MCP Integration

Event System

Core Events

Event Subscription

Configuration Modes

Traditional Mode (STT/TTS)

Realtime Mode (Speech-to-Speech)

Agent Example