How It Works
When several participants join a call, the agent maintains a separate audio queue for each one. A multi-speaker filter decides whose audio actually reaches the pipeline:- Each participant gets their own audio queue.
- A
FirstSpeakerWinsFilter(enabled by default) uses Silero VAD to detect speech - The first participant whose speech exceeds the VAD threshold acquires a lock — only their audio passes through
- Everyone else’s audio is dropped until the lock is released
- The lock releases when the active speaker’s turn ends or they go silent
Configuration
Pass amulti_speaker_filter to the Agent constructor to customize the behavior:
Omitting
multi_speaker_filter (or passing None) defaults to FirstSpeakerWinsFilter() with the parameters shown above.FirstSpeakerWinsFilter Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
speech_threshold | float | 0.5 | Silero VAD score (0.0–1.0) a participant must exceed to acquire the lock |
silence_release_ms | float | 1500.0 | Milliseconds of silence from the active speaker before releasing the lock |
- No lock held — all audio passes through. The first participant whose VAD score exceeds
speech_thresholdacquires the lock. - Lock held — only the locked speaker’s audio reaches the pipeline. Other participants’ audio is dropped without running VAD (no extra cost).
- Silence timeout — if the active speaker goes silent for
silence_release_ms, the lock is released. - Turn end — a
TurnEndedEventreleases the lock unconditionally. - Participant disconnects — the lock is cleared immediately if that participant held it.
Building a Custom AudioFilter
Replace the default filter with your own by implementing theAudioFilter interface:
Best Practices
Tune thresholds for your environment — Lowerspeech_threshold for quiet speakers; raise it to reject background noise. Adjust silence_release_ms based on expected pause lengths in your use case.
Combine with turn detection — The multi-speaker filter gates which speaker’s audio reaches the pipeline. Turn detection determines when the speaker has finished. They work together automatically.

