Skip to main content
xAI provides text-to-speech with five expressive voices, inline speech tags for delivery control, and multiple output codecs.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
xAI also provides an LLM and Realtime speech-to-speech. You can use all three in the same agent.

Installation

uv add "vision-agents[xai]"

Quick start

from vision_agents.core import Agent, User
from vision_agents.plugins import xai, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=xai.LLM(model="grok-4.1"),
    stt=deepgram.STT(),
    tts=xai.TTS(),
)
Set XAI_API_KEY in your environment or pass api_key directly.

Parameters

tts = xai.TTS(voice="eve", language="en", codec="pcm", sample_rate=24000)
NameTypeDefaultDescription
api_keystrNoneAPI key (defaults to XAI_API_KEY env var)
voicestr"eve"Voice ("eve", "ara", "leo", "rex", "sal")
languagestr"en"BCP-47 language code (e.g. "en", "zh", "pt-BR") or "auto"
codecstr"pcm"Output codec ("pcm", "wav", "mp3", "mulaw", "alaw")
sample_rateint24000Output sample rate in Hz (8000, 16000, 22050, 24000, 44100, or 48000)
bit_rateintNoneMP3 bit rate (only used when codec is "mp3")

Voices

VoiceDescription
eveEnergetic, upbeat — engaging and enthusiastic (default)
araWarm, friendly — balanced and conversational
leoAuthoritative, strong — commanding, great for instructional content
rexConfident, clear — professional, ideal for business
salSmooth, balanced — versatile for a wide range of contexts

Speech tags

You can use inline speech tags in your text for fine-grained delivery control. Inline tags: [pause] [long-pause] [laugh] [chuckle] [giggle] [cry] [tsk] [tongue-click] [lip-smack] [breath] [inhale] [exhale] [sigh] [hum-tune] Wrapping tags: <whisper>, <shout>, <slow>, <fast>, <soft>, <loud>, <high-pitch>, <low-pitch>, <sing>

Next steps

xAI LLM

Advanced reasoning with Grok

xAI Realtime

Speech-to-speech over WebSocket