Skip to main content

Agent Class

The Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers.

Overview

The Agent class serves as the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.

Basic Usage

from vision_agents.core import agents
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User

# Traditional STT/TTS mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    vad=silero.VAD(),
    processors=[yolo_processor]
)

# Realtime mode 
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[yolo_processor]
)

Constructor Parameters

Required Parameters

  • edge (StreamEdge): The edge network provider for video & audio transport (you can choose any provider here)
  • llm (LLM | Realtime): The language model, optionally with realtime capabilities
  • agent_user (User): The agent’s user information (name, id, etc.)

Optional Parameters

  • instructions (str): System instructions for the agent (default: “Keep your replies short and dont use special characters.”)
  • stt (Optional[STT]): Speech-to-text service (not needed for realtime mode)
  • tts (Optional[TTS]): Text-to-speech service (not needed for realtime mode)
  • turn_detection (Optional[BaseTurnDetector]): Turn detection service (not needed for realtime mode)
  • vad (Optional[VAD]): Voice activity detection service
  • processors (Optional[List[Processor]]): List of processors for video/audio processing
  • mcp_servers (Optional[List[MCPBaseServer]]): MCP servers for external tool access

Key Methods

Core Lifecycle Methods

async join(call: Call) -> AgentSessionContextManager

Joins a video call and returns a context manager for the session.
with await agent.join(call):
    # Agent is now active in the call
    await agent.finish()  # Wait for call to end

async finish()

Waits for the call to end gracefully. Subscribes to the call ended event.

async close()

Cleans up all connections and resources. Safe to call multiple times.

async create_user()

Creates the agent user in the edge provider if required.

Response Methods

async simple_response(text: str, participant: Optional[Participant] = None)

Sends a simple text response to the LLM for processing.
await agent.simple_response("Hello, how can I help you?")

Event System

subscribe(function)

Subscribes a callback to the agent-wide event bus. The event bus merges events from edge, LLM, STT, TTS, VAD, and other plugins.
@agent.subscribe
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
    await agent.simple_response(f"Hello, {event.participant.user.name}!")

Properties

Mode Detection

  • realtime_mode (bool): Returns True if using a Realtime LLM implementation
  • publish_audio (bool): Whether the agent should publish an outbound audio track
  • publish_video (bool): Whether the agent should publish an outbound video track

Processor Access

  • audio_processors (List[Processor]): Processors that can process audio
  • video_processors (List[Processor]): Processors that can process video
  • image_processors (List[Processor]): Processors that can process images
  • video_publishers (List[Processor]): Processors capable of publishing video tracks
  • audio_publishers (List[Processor]): Processors capable of publishing audio tracks

MCP Integration

The Agent supports Model Context Protocol (MCP) for external tool integration:
from vision_agents.core.mcp import MCPServerRemote

# Create MCP server
github_server = MCPServerRemote(
    url="https://api.githubcopilot.com/mcp/",
    headers={"Authorization": f"Bearer {github_pat}"}
)

# Add to agent
agent = agents.Agent(
    # ... other parameters
    mcp_servers=[github_server]
)
MCP tools are automatically registered with the LLM’s function registry and can be called during conversations. Please check out our MCP guide for more.

Event System

The Agent provides a comprehensive event system that merges events from all components:

Core Events

  • Audio Events: AudioReceivedEvent, VADAudioEvent
  • Transcript Events: STTTranscriptEvent, RealtimeTranscriptEvent
  • LLM Events: LLMResponseEvent, StandardizedTextDeltaEvent
  • Agent Events: AgentSayEvent, AgentSayStartedEvent, AgentSayCompletedEvent
  • Call Events: CallEndedEvent, CallSessionParticipantJoinedEvent

Event Subscription

@agent.subscribe
async def on_audio_received(event: AudioReceivedEvent):
    # Handle audio data
    pass

@agent.subscribe
async def on_transcript(event: STTTranscriptEvent):
    # Handle transcript
    await agent.simple_response(event.text)

Configuration Modes

Traditional Mode (STT/TTS)

Uses separate STT and TTS services with an LLM:
agent = agents.Agent(
    edge=getstream.Edge(),
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    vad=silero.VAD(),
    # ... other parameters
)

Realtime Mode (Speech-to-Speech)

Uses a realtime LLM that handles speech directly:
agent = agents.Agent(
    edge=getstream.Edge(),
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    # STT, TTS, and VAD not needed
    # ... other parameters
)

Agent Example

import asyncio
from uuid import uuid4
from vision_agents.core import agents, cli
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User
from vision_agents.core.events import CallSessionParticipantJoinedEvent

async def main():
    # Create agent
    agent = agents.Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI Assistant", id="agent"),
        instructions="You're a helpful voice AI assistant. Keep responses short and conversational.",
        llm=openai.LLM(model="gpt-4o-mini"),
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
        processors=[]
    )
    
    # Set up event handlers
    @agent.subscribe
    async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
        await agent.simple_response(f"Hello, {event.participant.user.name}!")
    
    # Create and join call
    await agent.create_user()
    call = agent.edge.client.video.call("default", str(uuid4()))
    agent.edge.open_demo(call)
    
    with await agent.join(call):
        await agent.finish()
    
    await agent.close()

if __name__ == "__main__":
    asyncio.run(cli.start_dispatcher(main))
I