How It Works
When several participants publish audio, the agent maintains a separate audio queue for each one. A multi-speaker filter decides whose audio actually reaches the pipeline:- Each participant gets their own audio queue.
- A
FirstSpeakerWinsFilter(enabled by default) uses Silero VAD to detect speech. - The first participant whose VAD score exceeds
speech_thresholdacquires a lock — only their audio passes through. - Everyone else’s audio is dropped until the lock is released.
- The lock releases when the active speaker goes silent for
silence_release_ms, or when that participant disconnects.
Configuration
Pass amulti_speaker_filter to the Agent constructor to customize the behavior:
Omitting
multi_speaker_filter (or passing None) defaults to FirstSpeakerWinsFilter() with the parameters shown above.FirstSpeakerWinsFilter Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
speech_threshold | float | 0.5 | Silero VAD score (0.0–1.0) a participant must exceed to acquire the lock |
silence_release_ms | float | 1500.0 | Milliseconds of silence from the active speaker before releasing the lock |
model_dir | str | "/tmp/first_speaker_wins_model" | Directory for Silero VAD model files |
- No lock held — all audio passes through. The first participant whose VAD score exceeds
speech_thresholdacquires the lock. - Lock held — only the locked speaker’s audio reaches the pipeline. Other participants’ audio is dropped without running VAD (no extra cost).
- Silence timeout — if the active speaker goes silent for
silence_release_ms, the lock is released. - Participant disconnects — the lock is cleared immediately if that participant held it.
active_speaker_id property on the filter to inspect which participant currently holds the lock.
Realtime Mode
The same filter path applies beforellm.process_audio() in realtime mode. Lock release is via silence timeout and disconnect only — there are no STT turn signals in realtime.
Disabling the Filter
PassingNone still defaults to FirstSpeakerWinsFilter. To disable filtering, pass a pass-through implementation:
Building a Custom AudioFilter
Replace the default filter with your own by implementing theAudioFilter interface:
Best Practices
Tune thresholds for your environment — Lowerspeech_threshold for quiet speakers; raise it to reject background noise. Adjust silence_release_ms based on expected pause lengths in your use case.
Combine with turn detection — The multi-speaker filter gates which speaker’s audio reaches the pipeline. Turn detection determines when the speaker has finished. They operate independently — after turn detection fires, the lock may persist until the silence timeout elapses.
Next Steps
Interruption Handling
Handle user interruptions
Turn Detection
VAD and turn detection concepts