STT -> LLM -> TTS
pipeline. In this guide, we will show examples using both; however, developers can choose the best option. We recommend using the real-time version of OpenAI and Gemini for fast, low-latency agents.
If you want full control over your voice pipeline, such as using a custom LLM like Grok or Anthropic, consider the second approach. Both approaches follow our philosophy of thin wrapping, meaning if the Agent
does not expose something for you directly, the underlying client can either be passed in or accessed directly.
Building with Real-Time OpenAI and Gemini Models
Both OpenAI and Gemini support voice agents directly at the model layer. This means developers are not required to manually pass in text-to-speech, speech-to-text, or voice activity/turn-taking models to the agent; the model has built-in support for these. Let’s build a simple voice agent using the Gemini Live model to get started. For this, we will need to install the following in a new Python3.12+
project:
main.py
file, we can start by importing the packages required for our project:
.env
variables required for our sample. Since we are running the Gemini model in this example, you will need to have the following in your .env
:
Both Stream and Google offer free API keys. For Gemini, developers can get a free API key on Google’s AI Studio while Stream developers can get their API key on the Stream Dashboard
start_agent
function where most of our code will live. In this method, we can setup the Agent
, pass in basic instructions for the model, configure the edge layer and user our agent will join the call as:
Agent
allows you to interact with the Gemini model in two ways:
- Using
simple_response
, a convenience method for quickly sending some text to the model without changing any additional parameters. - Using
send_realtime_input
, the native Gemini Realtime Input method which allows you to interact with the model directly.
Rather than passing in instructions directly in the agent creation step, you can also use @mention syntax in the instructions string, like so:
uv run main.py
which kicks off the agent and automatically opens the Stream Video demo app as the UI 🎉.
Custom voice agent pipelines
For advanced voice pipelines, such as wanting to use a different LLM provider, custom voices, VADs, etc., the Agent framework also allows you to override these properties directly. Like the previous example, which uses the OpenAI WebRTC connection and Gemini Live API, this method breaks things out into their individual parts and connects them together internally within theAgent
class.
For example, you could use OpenAI’s GPT-5 as the underlying model but customise the responses by creating a custom voice with Cartesia. In this case, we would make a few small changes to our earlier example.
First, in our imports, let’s remove the gemini
plugin and replace it with OpenAI. We will also add the cartesia
and deepgram
packages since we will be using their TTS and STT services respectively.
.env
with the API keys for OpenAI, Cartesia and Deepgram. Each of these services provide developers with the option to create a free API key on their website with generous limits.
Agent
class, we can change the LLM in use and pass in the clients for TTS and STT:
create_response
method directly for advanced requests:
uv run main.py
should once again bring our agent to life with the familiar Stream demo screen.
Advanced
Both theRealtime
and traditional LLM
modes support things like conversation, memory and function calling out of the box. By default, the Agent will write STT and LLM responses to Stream’s real-time Chat API which is linked to the Call ID
. For Function calling and MCP, functions can be annotated with @llm.register_function
. They are automatically picked up and transformed into the right format for the LLM: