Skip to main content
Fish Audio is a high-quality AI voice platform that provides both Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities. It offers fast, accurate transcription with automatic language detection and natural-sounding voice synthesis with support for voice cloning. The Fish Audio plugin for Vision Agents enables real-time transcription and speech synthesis, making it ideal for voice agents, multilingual applications, and conversational AI systems.

Installation

Install the Fish Audio plugin with
uv add vision-agents[fish]

Example

Check out our Fish Audio example to see a practical implementation of the plugin, or read on for some key details.

Text-to-Speech (TTS)

Initialisation

The Fish Audio TTS plugin is exposed via the TTS class:
from vision_agents.plugins import fish

# Initialize with default settings
tts = fish.TTS()

# Or with custom options
tts = fish.TTS(
    api_key="your-api-key",
    reference_id="your_reference_voice_id"
)
To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Parameters

These are the parameters available in the Fish TTS plugin:
NameTypeDefaultDescription
api_keystr or NoneNoneYour Fish Audio API key. If not provided, uses the FISH_API_KEY environment variable.
reference_idstr or NoneNoneOptional reference voice ID for voice cloning. Uses a default voice if not specified.
base_urlstr or NoneNoneOptional custom API endpoint.
clientSession or NoneNoneOptionally pass your own Fish Audio Session instance.

Functionality

Send text to convert to speech

The send() method sends text to Fish Audio for synthesis. The resulting audio is played through the configured output track:
await tts.send("Hello, this is a test of Fish Audio text-to-speech.")

Voice Cloning

Fish Audio supports voice cloning using reference audio:
# Using a reference voice ID
tts = fish.TTS(reference_id="your_reference_voice_id")

# The reference voice will be used for all subsequent synthesis
await tts.send("This will use the reference voice.")

Speech-to-Text (STT)

Initialisation

The Fish Audio STT plugin is exposed via the STT class:
from vision_agents.plugins import fish

# Initialize with default settings
stt = fish.STT()

# Or with custom options
stt = fish.STT(
    api_key="your-api-key",
    language="en"
)
To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set.

Parameters

These are the parameters available in the Fish STT plugin:
NameTypeDefaultDescription
api_keystr or NoneNoneYour Fish Audio API key. If not provided, uses the FISH_API_KEY environment variable.
languagestr or NoneNoneLanguage code for transcription (e.g., “en”, “zh”). If None, automatic language detection is used.
clientSession or NoneNoneOptionally pass your own Fish Audio Session instance.

Functionality

Process Audio

Once you join the call, you can listen for audio events and pass them to the STT class for processing:
from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def on_audio(pcm: PcmData, user):
        # Process audio through Fish Audio STT
        await stt.process_audio(pcm, user)

Events

Transcript Event
The transcript event is triggered when a final transcript is available from Fish Audio:
from vision_agents.core.stt.events import STTTranscriptEvent

@stt.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
    print(f"Final transcript: {event.text}")
    print(f"User: {event.participant.user_id}")
    print(f"Language: {event.response.language}")
Error Event
If an error occurs during transcription, an error event is fired:
from vision_agents.core.stt.events import STTErrorEvent

@stt.events.subscribe
async def on_stt_error(event: STTErrorEvent):
    print(f"STT error: {event.error}")

Supported Languages

Fish Audio STT supports multiple languages with automatic detection:
  • en - English
  • zh - Chinese
  • es - Spanish
  • fr - French
  • de - German
  • ja - Japanese
  • ko - Korean
  • pt - Portuguese
For automatic language detection, set language=None (default).

Audio Format Requirements

The STT implementation accepts PCM audio data with the following specifications:
  • Sample rate: 16kHz or higher recommended
  • Format: Mono, 16-bit PCM

Getting Your API Key

  1. Sign up for a Fish Audio account at https://fish.audio
  2. Navigate to the API Keys section in your dashboard
  3. Create a new API key
  4. Set the FISH_API_KEY environment variable or pass it directly to the plugin