Skip to main content
AssemblyAI provides real-time streaming speech-to-text with built-in punctuation-based turn detection and sub-300ms latency.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add "vision-agents[assemblyai]"

Quick start

from vision_agents.core import Agent, User
from vision_agents.plugins import assemblyai, cartesia, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM(),
    stt=assemblyai.STT(),
    tts=cartesia.TTS(),
)
Set ASSEMBLYAI_API_KEY in your environment or pass api_key directly.

STT

Real-time transcription using AssemblyAI’s Universal-3 Pro model with built-in turn detection.
stt = assemblyai.STT(
    speech_model="u3-rt-pro",
    sample_rate=16000,
)

With keyterms boosting

Boost recognition accuracy for specific terms:
stt = assemblyai.STT(
    keyterms_prompt=["AssemblyAI", "Vision Agents"],
)

With custom turn silence thresholds

Configure turn detection timing:
stt = assemblyai.STT(
    min_turn_silence=100,   # ms before speculative EOT check
    max_turn_silence=1200,  # ms before forcing turn end
)

Parameters

NameTypeDefaultDescription
api_keystrNoneAPI key (defaults to ASSEMBLYAI_API_KEY env var)
speech_modelstr"u3-rt-pro"Model identifier
sample_rateint16000Audio sample rate in Hz
min_turn_silenceintAPI defaultSilence (ms) before speculative end-of-turn check
max_turn_silenceintAPI defaultMaximum silence (ms) before forcing turn end
promptstrNoneCustom transcription prompt (cannot be combined with keyterms_prompt)
keyterms_promptlist[str]NoneList of terms to boost recognition for (cannot be combined with prompt)
max_reconnect_attemptsint3Maximum reconnect attempts on transient failures
reconnect_backoff_initial_sfloat0.5Initial backoff delay in seconds
reconnect_backoff_max_sfloat4.0Maximum backoff delay in seconds

Next steps