> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech-to-Text and Text-to-Speech Class

LLMs not running using a `Realtime` model requires some help to convert the user's speech and LLM responses into something the user can speak to and hear. To achieve this, the `Agent` class exposes two parameters `tts` and `stt` allowing developers to pass in any text-to-speech and speech-to-text service they like. Using this method, the output voices can be configured, the transcription rate can be adjusted and more.

Internally, the Agent class handles the management between these services and things such as setting up the audio track for the STT providers as an example.

### STT (Speech-to-Text)

STT components convert audio input into text for processing by the LLM. All implementations follow a standardised interface with consistent event emission.

These components process real-time audio with `PcmData` objects from `getstream.video.rtc.track_util`, provide partial transcript support for responsive UI, and include comprehensive error handling and connection management. Multiple providers are supported including Deepgram, ElevenLabs, Fast Whisper, and others.

All STT providers must call `await stt.start()` before processing audio to initialize connections and resources.

Some STT providers include built-in turn detection (indicated by the `turn_detection` property). When this is the case, the Agent automatically skips any separately configured `TurnDetector` to avoid conflicts.

#### STT Methods

| Method                                 | Description                                                     |
| -------------------------------------- | --------------------------------------------------------------- |
| `start()`                              | Initialize connections and resources. Must be called before use |
| `process_audio(pcm_data, participant)` | Process an audio frame (\~20ms chunks)                          |
| `clear()`                              | Clear any pending audio or internal state                       |
| `close()`                              | Clean up resources                                              |

#### STT Events

| Event                  | Description                                       |
| ---------------------- | ------------------------------------------------- |
| `STTConnectedEvent`    | STT connection established                        |
| `STTDisconnectedEvent` | STT connection closed (with `reason` and `clean`) |
| `STTErrorEvent`        | Temporary, recoverable error                      |

Final user transcripts surface on the agent as `UserTranscriptEvent` (from `vision_agents.core.agents.events`) — not as a separate STT event — so the same handler works in both classic STT and realtime modes. See [Events Reference](/reference/events-reference).

### TTS (Text-to-Speech)

TTS components convert LLM responses into audio output. They handle audio synthesis and streaming to the output track.

These components provide streaming audio synthesis for low latency, multiple voice options and customisation, audio format standardisation using `PcmData` and `AudioFormat` from `getstream.video.rtc.track_util`, and support for providers like ElevenLabs, Cartesia, and others.

#### TTS Methods

| Method                                                   | Description                                                                               |
| -------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `set_output_format(sample_rate, channels, audio_format)` | Configure output audio format. Audio is automatically resampled and re-channeled to match |
| `send_iter(text, participant=None)`                      | Convert text to speech and yield `TTSOutputChunk` items                                   |
| `stop_audio()`                                           | Clear the audio queue and stop current playback                                           |
| `interrupt()`                                            | Increment the interruption epoch and cancel stale in-flight synthesis                     |
| `close()`                                                | Clean up resources                                                                        |

#### TTS Events

| Event                       | Description                                                                                       |
| --------------------------- | ------------------------------------------------------------------------------------------------- |
| `TTSSynthesisStartEvent`    | Synthesis has begun for a text input                                                              |
| `TTSSynthesisCompleteEvent` | Synthesis finished (includes metrics like `synthesis_time_ms`, `chunk_count`, `real_time_factor`) |
| `TTSConnectedEvent`         | TTS connection established                                                                        |
| `TTSDisconnectedEvent`      | TTS connection closed (with `reason` and `clean`)                                                 |
| `TTSErrorEvent`             | Temporary, recoverable error                                                                      |

#### Interruption support

The TTS base class exposes an `epoch` property and an `interrupt()` method for handling barge-in scenarios:

| Member        | Type    | Description                                                                                         |
| ------------- | ------- | --------------------------------------------------------------------------------------------------- |
| `epoch`       | `int`   | Monotonic counter that increments on each interruption. Used to identify stale audio events.        |
| `interrupt()` | `async` | Increments the epoch and stops the current audio synthesis. Stale events are automatically dropped. |

You usually do not need to call `interrupt()` manually — `agent.simple_response(..., interrupt=True)` and `agent.say(..., interrupt=True)` route interruption through the active inference flow.