Speech-to-Text and Text-to-Speech Class

LLMs not running using a Realtime model requires some help to convert the user’s speech and LLM responses into something the user can speak to and hear. To achieve this, the Agent class exposes two parameters tts and stt allowing developers to pass in any text-to-speech and speech-to-text service they like. Using this method, the output voices can be configured, the transcription rate can be adjusted and more. Internally, the Agent class handles the management between these services and things such as setting up the audio track for the STT providers as an example.

STT (Speech-to-Text)

STT components convert audio input into text for processing by the LLM. All implementations follow a standardised interface with consistent event emission. These components process real-time audio with PcmData objects from getstream.video.rtc.track_util, provide partial transcript support for responsive UI, and include comprehensive error handling and connection management. Multiple providers are supported including Deepgram, Moonshine, and others. All STT providers must call await stt.start() before processing audio to initialize connections and resources. Events include STTTranscriptEvent for complete results, STTPartialTranscriptEvent for real-time display, and STTErrorEvent for handling and recovery.

TTS (Text-to-Speech)

TTS components convert LLM responses into audio output. They handle audio synthesis and streaming to the output track. These components provide streaming audio synthesis for low latency, multiple voice options and customisation, audio format standardisation using PcmData and AudioFormat from getstream.video.rtc.track_util, and support for providers like ElevenLabs, Cartesia, and others. Events include TTSAudioEvent for ready chunks, TTSSynthesisStartedEvent when processing begins, and TTSSynthesisCompletedEvent when finished.

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

Speech-to-Text and Text-to-Speech Class

STT (Speech-to-Text)

TTS (Text-to-Speech)

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

​STT (Speech-to-Text)

​TTS (Text-to-Speech)

STT (Speech-to-Text)

TTS (Text-to-Speech)