Voice Agent Starter - Vision Agents

View Simple Agent Example on GitHub

Check out the complete Simple Agent example in our GitHub repository

Build a custom STT → LLM → TTS voice agent with Gemini for reasoning, Deepgram for speech recognition, and ElevenLabs for natural-sounding responses. The agent joins a video call, handles voice conversation, and can observe the camera feed.

Complete the Quickstart first. This example uses a custom pipeline (not Realtime mode) — see Voice Agents for the same pattern with additional providers.

What You Will Build

Listen to user speech and convert it to text with Deepgram STT
Process conversations using Gemini with function calling (weather tool)
Respond with natural-sounding speech via ElevenLabs TTS
Run on Stream’s low-latency edge network

Prerequisites

API keys for Stream, Gemini, Deepgram, and ElevenLabs. Free tiers are available from each provider.

STREAM_API_KEY=
STREAM_API_SECRET=
GOOGLE_API_KEY=
DEEPGRAM_API_KEY=
ELEVENLABS_API_KEY=

Run the example

Clone and install

Clone the repo and install dependencies from the root:

git clone git@github.com:GetStream/Vision-Agents.git
cd Vision-Agents
uv sync

Configure environment

Create a .env file at the repo root with your API keys (see Prerequisites above).

Run the agent

From the example directory:

cd examples/01_simple_agent_example
uv run simple_agent_example.py run

The CLI opens a browser demo. Join the call and speak to the agent. Ask about the weather to trigger the registered get_weather function.

How it works

The agent uses a custom pipeline instead of a Realtime model:

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="My happy AI friend", id="agent"),
    instructions=INSTRUCTIONS,
    llm=setup_llm(),  # gemini.LLM() with @register_function
    tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
    stt=deepgram.STT(eager_turn_detection=True),
)

Audio flows: user speaks → Deepgram STT transcribes → Gemini LLM generates a response (or calls a tool) → ElevenLabs TTS speaks it back. Deepgram’s eager_turn_detection=True reduces latency by starting LLM inference before the user fully stops speaking.

Customize

Swap providers: any STT, LLM, and TTS plugin works — see Integrations.
Use Realtime instead: replace the pipeline with llm=gemini.Realtime() and remove stt and tts — same pattern as the Quickstart.
Add processors: pass items to processors=[] for video analysis — see AI Golf Coach.

Next Steps

AI Golf Coach

Add video processing with YOLO pose detection

Voice Agents

Custom pipelines and function calling

Integrations

Swap in any of 35+ supported AI providers

View Simple Agent Example on GitHub

​What You Will Build

​Prerequisites

​Run the example

​How it works

​Customize

​Next Steps

AI Golf Coach

Voice Agents

Integrations

What You Will Build

Prerequisites

Run the example

How it works

Customize

Next Steps