Fish Audio STT - Vision Agents

Fish Audio provides speech-to-text with automatic language detection. Buffers audio per participant (minimum 1 second) before sending to the API for accurate transcription.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Fish Audio also provides high-quality text-to-speech with prosody control and voice cloning. You can use both in the same agent.

Installation

uv add "vision-agents[fish]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import fish, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=fish.STT(),
    tts=fish.TTS(),
)

Set FISH_API_KEY in your environment or pass api_key directly.

Parameters

stt = fish.STT(language="en")  # Or None for auto-detection

Name	Type	Default	Description
`language`	`str`	`None`	Language code (`"en"`, `"zh"`, etc.) or `None` for auto-detect
`api_key`	`str`	`None`	API key (defaults to `FISH_API_KEY` env var)

Next Steps

Fish Audio TTS

Text-to-speech with prosody control

Build a Voice Agent

Get started with voice

Fast-Whisper Mistral Voxtral

​Installation

​Quick Start

​Parameters

​Next Steps

Fish Audio TTS

Build a Voice Agent

Installation

Quick Start

Parameters

Next Steps