Skip to main content
xAI provides realtime speech-to-speech over WebSocket with server-side voice activity detection, built-in web search, and X search. No separate STT/TTS needed.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
xAI also provides a traditional LLM and standalone text-to-speech.

Installation

uv add "vision-agents[xai]"

Quick start

from vision_agents.core import Agent, User
from vision_agents.plugins import xai, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    llm=xai.Realtime(),
)
Set XAI_API_KEY in your environment or pass api_key directly.

Parameters

NameTypeDefaultDescription
modelstr"grok-4-1-fast-non-reasoning"Grok realtime model
voicestr"Ara"Voice ("Ara", "Rex", "Sal", "Eve", "Leo")
api_keystrNoneAPI key (defaults to XAI_API_KEY env var)
turn_detectionstr or None"server_vad"Turn detection mode ("server_vad" or None for manual)
vad_interrupt_responseboolFalseAllow VAD to auto-cancel the assistant response on detected speech
web_searchboolTrueEnable web search tool
x_searchboolTrueEnable X (Twitter) search tool
x_search_allowed_handleslist[str]NoneRestrict X search to specific handles
vad_interrupt_response defaults to False because speaker-to-mic echo can cause the server to cancel the agent’s own response mid-sentence. Set to True only if your audio setup avoids echo feedback.

Function calling

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> str:
    return f"The weather in {location} is sunny and 72°F"
See the Function Calling guide for details.

Next steps

xAI LLM

Advanced reasoning with Grok

xAI TTS

Text-to-speech with expressive voices