Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
Quick start
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
api_key | str | None | API key (defaults to XAI_API_KEY env var) |
voice | str | "eve" | Voice ("eve", "ara", "leo", "rex", "sal") |
language | str | "en" | BCP-47 language code (e.g. "en", "zh", "pt-BR") or "auto" |
codec | str | "pcm" | Output codec ("pcm", "wav", "mp3", "mulaw", "alaw") |
sample_rate | int | 24000 | Output sample rate in Hz (8000, 16000, 22050, 24000, 44100, or 48000) |
bit_rate | int | None | MP3 bit rate (only used when codec is "mp3") |
Voices
| Voice | Description |
|---|---|
eve | Energetic, upbeat — engaging and enthusiastic (default) |
ara | Warm, friendly — balanced and conversational |
leo | Authoritative, strong — commanding, great for instructional content |
rex | Confident, clear — professional, ideal for business |
sal | Smooth, balanced — versatile for a wide range of contexts |
Speech tags
You can use inline speech tags in your text for fine-grained delivery control. Inline tags:[pause] [long-pause] [laugh] [chuckle] [giggle] [cry] [tsk] [tongue-click] [lip-smack] [breath] [inhale] [exhale] [sigh] [hum-tune]
Wrapping tags: <whisper>, <shout>, <slow>, <fast>, <soft>, <loud>, <high-pitch>, <low-pitch>, <sing>
Next steps
xAI LLM
Advanced reasoning with Grok
xAI Realtime
Speech-to-speech over WebSocket

