Skip to main content
Fish Audio provides high-quality text-to-speech with fine-grained prosody control, voice cloning support, and multiple backend models. Ideal for multilingual applications.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Fish Audio also provides speech-to-text with automatic language detection. You can use both in the same agent.

Installation

uv add "vision-agents[fish]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import fish, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=fish.STT(),
    tts=fish.TTS(),  # Uses S2-Pro model by default
)
Set FISH_API_KEY in your environment or pass api_key directly.

Basic Usage

tts = fish.TTS(reference_id="your_voice_id")  # Optional voice cloning

Prosody Control

The S2-Pro model (default) supports inline control tags for natural prosody:
tts = fish.TTS()  # Uses s2-pro by default

# Include prosody tags in your text
text = "[whisper] This is a secret. [super happy] But this is great news!"
text = "Hello! [laugh] That's so funny."

Selecting a Model

# Use the latest S2-Pro model with prosody control
tts = fish.TTS(model="s2-pro")

# Use legacy models if needed
tts = fish.TTS(model="speech-1.5")
tts = fish.TTS(model="speech-1.6")

# Use fast models for lower latency
tts = fish.TTS(model="s1")
tts = fish.TTS(model="s1-mini")

Parameters

NameTypeDefaultDescription
modelstr"s2-pro"Backend model: "s2-pro", "speech-1.5", "speech-1.6", "s1", "s1-mini"
reference_idstrNoneVoice ID for voice cloning
api_keystrNoneAPI key (defaults to FISH_API_KEY env var)
base_urlstrNoneCustom API endpoint

Next Steps

Fish Audio STT

Speech-to-text with auto language detection

Build a Voice Agent

Get started with voice