Skip to main content
Cartesia provides low-latency speech-to-text with the Ink model. STT streams PCM audio to Cartesia Ink and emits transcript and turn events that Vision Agents uses for interruption and eager turn handling.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Cartesia also provides low-latency text-to-speech. You can use both in the same agent.

Installation

uv add "vision-agents[cartesia]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import cartesia, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=cartesia.STT(),
    tts=cartesia.TTS(),
)
Set CARTESIA_API_KEY in your environment or pass api_key directly.

Parameters

NameTypeDefaultDescription
modelstr"ink-2"Cartesia STT model
sample_rateint16000PCM sample rate (Hz) sent to Cartesia
encodingstr"pcm_s16le"PCM encoding sent to Cartesia
cartesia_versionstr"2026-03-01"Cartesia API version used for the turn-detection websocket
api_keystrNoneAPI key (defaults to CARTESIA_API_KEY env var)

Next Steps

Cartesia TTS

Low-latency text-to-speech

Build a Voice Agent

Get started with voice