Skip to main content
Handle calls where multiple human participants talk to the same agent. The framework routes audio per-participant and gates who the agent listens to at any given moment.

How It Works

When several participants publish audio, the agent maintains a separate audio queue for each one. A multi-speaker filter decides whose audio actually reaches the pipeline:
  1. Each participant gets their own audio queue.
  2. A FirstSpeakerWinsFilter (enabled by default) uses Silero VAD to detect speech.
  3. The first participant whose VAD score exceeds speech_threshold acquires a lock — only their audio passes through.
  4. Everyone else’s audio is dropped until the lock is released.
  5. The lock releases when the active speaker goes silent for silence_release_ms, or when that participant disconnects.
Participant A audio ─┐
                     ├──→ Multi-speaker filter ──→ STT → LLM → TTS
Participant B audio ─┘     (first speaker wins)
The filter only activates when two or more participants have active audio tracks. Single-speaker calls bypass it entirely with no overhead.

Configuration

Pass a multi_speaker_filter to the Agent constructor to customize the behavior:
from vision_agents.core import Agent, User
from vision_agents.core.utils.audio_filter import FirstSpeakerWinsFilter
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    llm=gemini.LLM("gemini-flash-lite-latest"),
    tts=elevenlabs.TTS(),
    stt=deepgram.STT(),
    multi_speaker_filter=FirstSpeakerWinsFilter(
        speech_threshold=0.5,
        silence_release_ms=1500.0,
    ),
)
Omitting multi_speaker_filter (or passing None) defaults to FirstSpeakerWinsFilter() with the parameters shown above.

FirstSpeakerWinsFilter Parameters

ParameterTypeDefaultDescription
speech_thresholdfloat0.5Silero VAD score (0.0–1.0) a participant must exceed to acquire the lock
silence_release_msfloat1500.0Milliseconds of silence from the active speaker before releasing the lock
model_dirstr"/tmp/first_speaker_wins_model"Directory for Silero VAD model files
Lock lifecycle:
  1. No lock held — all audio passes through. The first participant whose VAD score exceeds speech_threshold acquires the lock.
  2. Lock held — only the locked speaker’s audio reaches the pipeline. Other participants’ audio is dropped without running VAD (no extra cost).
  3. Silence timeout — if the active speaker goes silent for silence_release_ms, the lock is released.
  4. Participant disconnects — the lock is cleared immediately if that participant held it.
Use the active_speaker_id property on the filter to inspect which participant currently holds the lock.

Realtime Mode

The same filter path applies before llm.process_audio() in realtime mode. Lock release is via silence timeout and disconnect only — there are no STT turn signals in realtime.

Disabling the Filter

Passing None still defaults to FirstSpeakerWinsFilter. To disable filtering, pass a pass-through implementation:
from typing import Optional

from getstream.video.rtc import PcmData

from vision_agents.core.edge.types import Participant
from vision_agents.core.utils.audio_filter import AudioFilter


class PassThroughFilter(AudioFilter):
    async def process_audio(
        self, pcm: PcmData, participant: Participant
    ) -> Optional[PcmData]:
        return pcm

    def clear(self, participant: Optional[Participant] = None) -> None:
        pass


agent = Agent(..., multi_speaker_filter=PassThroughFilter())

Building a Custom AudioFilter

Replace the default filter with your own by implementing the AudioFilter interface:
from typing import Optional

from getstream.video.rtc import PcmData

from vision_agents.core.edge.types import Participant
from vision_agents.core.utils.audio_filter import AudioFilter


class MyCustomFilter(AudioFilter):
    async def process_audio(
        self, pcm: PcmData, participant: Participant
    ) -> Optional[PcmData]:
        """Return PcmData to pass the audio through, or None to drop it."""
        # Your logic here
        return pcm

    def clear(self, participant: Optional[Participant] = None) -> None:
        """Called on participant disconnect.

        If participant is provided, only clear state for that participant.
        If None, clear all state unconditionally.
        """
        pass
Then pass it to the agent:
agent = Agent(
    ...,
    multi_speaker_filter=MyCustomFilter(),
)

Best Practices

Tune thresholds for your environment — Lower speech_threshold for quiet speakers; raise it to reject background noise. Adjust silence_release_ms based on expected pause lengths in your use case. Combine with turn detection — The multi-speaker filter gates which speaker’s audio reaches the pipeline. Turn detection determines when the speaker has finished. They operate independently — after turn detection fires, the lock may persist until the silence timeout elapses.

Next Steps

Interruption Handling

Handle user interruptions

Turn Detection

VAD and turn detection concepts