Fish Audio provides speech-to-text with automatic language detection. Buffers audio per participant (minimum 1 second) before sending to the API for accurate transcription.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Fish Audio also provides high-quality text-to-speech with prosody control and voice cloning. You can use both in the same agent.