Mistral Voxtral provides real-time speech-to-text via WebSocket streaming with automatic language detection and low-latency transcription.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
uv add "vision-agents[mistral]"
Quick start
from vision_agents.core import Agent, User
from vision_agents.plugins import mistral, gemini, deepgram, getstream
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Assistant", id="agent"),
instructions="You are a helpful assistant.",
llm=gemini.LLM("gemini-2.5-flash"),
stt=mistral.STT(),
tts=deepgram.TTS(),
)
Set MISTRAL_API_KEY in your environment or pass api_key directly.
Parameters
| Name | Type | Default | Description |
|---|
api_key | str | None | API key (defaults to MISTRAL_API_KEY env var) |
model | str | "voxtral-mini-transcribe-realtime-2602" | Model identifier |
sample_rate | int | 16000 | Audio sample rate in Hz (8000, 16000, 22050, 44100, 48000) |
Turn detection
Mistral Voxtral STT does not include built-in turn detection. Pair it with an external turn detection plugin like Smart Turn or Vogent.
from vision_agents.plugins import mistral, smart_turn
agent = Agent(
stt=mistral.STT(),
turn_detection=smart_turn.TurnDetection(),
# ... other config
)
Next steps