How It Works
When several participants join a call, the agent maintains a separate audio queue for each one. A multi-speaker filter decides whose audio actually reaches the pipeline:- Each participant gets their own audio queue.
- A
FirstSpeakerWinsFilter(enabled by default) uses Silero VAD to detect speech - The first participant whose speech exceeds the VAD threshold acquires a lock — only their audio passes through
- Everyone else’s audio is dropped until the lock is released
- The lock releases when the active speaker’s turn ends or they go silent
Configuration
Pass amulti_speaker_filter to the Agent constructor to customize the behavior:
Omitting
multi_speaker_filter (or passing None) defaults to FirstSpeakerWinsFilter() with the parameters shown above.FirstSpeakerWinsFilter Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
speech_threshold | float | 0.5 | Silero VAD score (0.0–1.0) a participant must exceed to acquire the lock |
silence_release_ms | float | 1500.0 | Milliseconds of silence from the active speaker before releasing the lock |
- No lock held — all audio passes through. The first participant whose VAD score exceeds
speech_thresholdacquires the lock. - Lock held — only the locked speaker’s audio reaches the pipeline. Other participants’ audio is dropped without running VAD (no extra cost).
- Silence timeout — if the active speaker goes silent for
silence_release_ms, the lock is released. - Turn end — a
TurnEndedEventreleases the lock unconditionally. - Participant disconnects — the lock is cleared immediately if that participant held it.
Building a Custom AudioFilter
Replace the default filter with your own by implementing theAudioFilter interface:
Best Practices
Tune thresholds for your environment — Lowerspeech_threshold for quiet speakers; raise it to reject background noise. Adjust silence_release_ms based on expected pause lengths in your use case.
Combine with turn detection — The multi-speaker filter gates which speaker’s audio reaches the pipeline. Turn detection determines when the speaker has finished. They work together automatically.
Next Steps
Interruption Handling
Handle user interruptions
Turn Detection
VAD and turn detection concepts

