Skip to main content
Using Vision Agents, developers can build voice agents using one of two modes. The first is using our out-of-the-box support for OpenAI Realtime or Gemini Live and the second allows for a more traditional STT -> LLM -> TTS pipeline. In this guide, we will show examples using both; however, developers can choose the best option. We recommend using the real-time version of OpenAI and Gemini for fast, low-latency agents. If you want full control over your voice pipeline, such as using a custom LLM like Grok or Anthropic, consider the second approach. Both approaches follow our philosophy of thin wrapping, meaning if the Agent does not expose something for you directly, the underlying client can either be passed in or accessed directly.

Building with Real-Time OpenAI and Gemini Models

Both OpenAI and Gemini support voice agents directly at the model layer. This means developers are not required to manually pass in text-to-speech, speech-to-text, or voice activity/turn-taking models to the agent; the model has built-in support for these. Let’s build a simple voice agent using the Gemini Live model to get started. For this, we will need to install the following in a new Python 3.12+ project:
uv init

uv add "vision-agents[getstream, gemini]"
Next, in our main.py file, we can start by importing the packages required for our project:
import asyncio
import logging
from uuid import uuid4

from dotenv import load_dotenv

from vision_agents.core.edge.types import User
from vision_agents.plugins import getstream, gemini 
from vision_agents.core import agents, cli

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

load_dotenv()
This sets up some basic logging and loads in the .env variables required for our sample. Since we are running the Gemini model in this example, you will need to have the following in your .env:
# Stream API credentials
STREAM_API_KEY=
STREAM_API_SECRET=

# Gemini
GOOGLE_API_KEY=
Both Stream and Google offer free API keys. For Gemini, developers can get a free API key on Google’s AI Studio while Stream developers can get their API key on the Stream Dashboard
Next, we can define our start_agent function where most of our code will live. In this method, we can setup the Agent, pass in basic instructions for the model, configure the edge layer and user our agent will join the call as:
async def start_agent() -> None:

    llm = gemini.Realtime()
    # create an agent to run with Stream's edge, Gemini llm
    agent = agents.Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a voice AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful.",
        processors=[],  # processors can fetch extra data, check images/audio data or transform video
        # llm with tts & stt. if you use a realtime (sts capable) llm the tts, stt and vad aren't needed
        llm=llm,
    )

    await agent.create_user()
The Agent allows you to interact with the Gemini model in two ways:
  1. Using simple_response, a convenience method for quickly sending some text to the model without changing any additional parameters.
  2. Using send_realtime_input, the native Gemini Realtime Input method which allows you to interact with the model directly.
Rather than passing in instructions directly in the agent creation step, you can also use @mention syntax in the instructions string, like so:
instructions="Read @voice-agent-instructions.md"
    # Create a call
    call = agent.edge.client.video.call("default", str(uuid4()))

    # Open the demo UI
    agent.edge.open_demo(call)

    # Have the agent join the call/room
    with await agent.join(call):
        # Example 1: standardized simple response
        await agent.llm.simple_response("chat with the user about the weather.")
        # run till the call ends
        await agent.finish()


if __name__ == "__main__":
    asyncio.run(cli.start_dispatcher(start_agent))
To run our example, we can call uv run main.py which kicks off the agent and automatically opens the Stream Video demo app as the UI 🎉.

Custom voice agent pipelines

For advanced voice pipelines, such as wanting to use a different LLM provider, custom voices, VADs, etc., the Agent framework also allows you to override these properties directly. Like the previous example, which uses the OpenAI WebRTC connection and Gemini Live API, this method breaks things out into their individual parts and connects them together internally within the Agent class. For example, you could use OpenAI’s GPT-5 as the underlying model but customise the responses by creating a custom voice with Cartesia. In this case, we would make a few small changes to our earlier example. First, in our imports, let’s remove the gemini plugin and replace it with OpenAI. We will also add the cartesia and deepgram packages since we will be using their TTS and STT services respectively.
from vision_agents.plugins import getstream, openai, cartesia, deepgram  
Next, we need to update our .env with the API keys for OpenAI, Cartesia and Deepgram. Each of these services provide developers with the option to create a free API key on their website with generous limits.
# Deepgram API credentials
DEEPGRAM_API_KEY=
# Cartesia API credentials
CARTESIA_API_KEY=
# OpenAI API credentials
OPENAI_API_KEY=
Finally in our Agent class, we can change the LLM in use and pass in the clients for TTS and STT:
    # create an agent to run with Stream's edge, openAI llm
    agent = agents.Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a voice AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful.",
        llm=openai.LLM(),
        tts=cartesia.TTS(),
        stt=deepgram.STT()
    )
In our example API calls, we can also add OpenAI’s create_response method directly for advanced requests:
        # Example 1: standardized simple response
        await agent.llm.simple_response("chat with the user about the weather.")
        # Example 2: use native openAI create response
        await llm.create_response(input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "Tell me a short poem about this image"},
                    {"type": "input_image", "image_url": f"https://images.unsplash.com/photo-1757495361144-0c2bfba62b9e?q=80&w=2340&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"},
                ],
            }
        ],)

        # run till the call ends
        await agent.finish()
Running once more with uv run main.py should once again bring our agent to life with the familiar Stream demo screen.

Advanced

Both the Realtime and traditional LLM modes support things like conversation, memory and function calling out of the box. By default, the Agent will write STT and LLM responses to Stream’s real-time Chat API which is linked to the Call ID. For Function calling and MCP, functions can be annotated with @llm.register_function. They are automatically picked up and transformed into the right format for the LLM:
    @llm.register_function(description="Get current weather for a location")
    def get_weather(location: str):
        """Get the current weather for a location."""
        return {
            "location": location,
            "temperature": "22°C",
            "condition": "Sunny",
            "humidity": "65%"
        }
MCP servers can be passed directly to the Agent class as a list:
    # Create GitHub MCP server
    github_server = MCPServerRemote(
        url="https://api.githubcopilot.com/mcp/",
        headers={"Authorization": f"Bearer {github_pat}"},
        timeout=10.0,  # Shorter connection timeout
        session_timeout=300.0
    )

    agent = Agent(
        edge=edge,
        llm=llm,
        agent_user=agent_user,
        instructions="You are a helpful AI assistant with access to GitHub via MCP server. You can help with GitHub operations like creating issues, managing pull requests, searching repositories, and more. Keep responses conversational and helpful.",
        processors=[],
        mcp_servers=[github_server],
        tts=cartesia.TTS(),
        stt=deepgram.STT(),
        vad=silero.VAD()
    )

For more on these topics, check out our guides on MCP and Function Calling, Chat and Memory and Processors. Building with Vision Agents? Share it with us, we’re always keen to see (and share) projects from the community.
I