Fish Audio provides high-quality text-to-speech with fine-grained prosody control, voice cloning support, and multiple backend models. Ideal for multilingual applications.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Fish Audio also provides speech-to-text with automatic language detection. You can use both in the same agent.
The S2-Pro model (default) supports inline control tags for natural prosody:
tts = fish.TTS() # Uses s2-pro by default# Include prosody tags in your texttext = "[whisper] This is a secret. [super happy] But this is great news!"text = "Hello! [laugh] That's so funny."
# Use the latest S2-Pro model with prosody controltts = fish.TTS(model="s2-pro")# Use legacy models if neededtts = fish.TTS(model="speech-1.5")tts = fish.TTS(model="speech-1.6")# Use fast models for lower latencytts = fish.TTS(model="s1")tts = fish.TTS(model="s1-mini")