Cartesia - Vision Agents

Cartesia provides low-latency text-to-speech with the Sonic model. Designed for real-time voice applications with natural-sounding speech synthesis.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Cartesia also provides low-latency speech-to-text. You can use both in the same agent.

Installation

uv add "vision-agents[cartesia]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import cartesia, gemini, deepgram, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=cartesia.TTS(),
)

Set CARTESIA_API_KEY in your environment or pass api_key directly.

Parameters

Name	Type	Default	Description
`model_id`	`str`	`"sonic-3.5"`	Cartesia TTS model
`voice_id`	`str`	`"6ccbfb76-1fc6-48f7-b71d-91ac6298247b"`	Voice ID
`sample_rate`	`int`	`16000`	Audio sample rate in Hz
`api_key`	`str`	`None`	API key (defaults to `CARTESIA_API_KEY` env var)

Next Steps

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing

AWS Polly Deepgram TTS

​Installation

​Quick Start

​Parameters

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

Quick Start

Parameters

Next Steps