Skip to main content
Handle calls where multiple human participants talk to the same agent. The framework routes audio per-participant and gates who the agent listens to at any given moment.

How It Works

When several participants join a call, the agent maintains a separate audio queue for each one. A multi-speaker filter decides whose audio actually reaches the pipeline:
  1. Each participant gets their own audio queue.
  2. A FirstSpeakerWinsFilter (enabled by default) uses Silero VAD to detect speech
  3. The first participant whose speech exceeds the VAD threshold acquires a lock — only their audio passes through
  4. Everyone else’s audio is dropped until the lock is released
  5. The lock releases when the active speaker’s turn ends or they go silent
Participant A audio ─┐
                     ├──→ Multi-speaker filter ──→ STT → LLM → TTS
Participant B audio ─┘     (first speaker wins)
The filter only activates when two or more participants are on the call. Single-speaker calls bypass it entirely with no overhead.

Configuration

Pass a multi_speaker_filter to the Agent constructor to customize the behavior:
from vision_agents.core import Agent, User
from vision_agents.core.utils.audio_filter import FirstSpeakerWinsFilter
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    llm=gemini.LLM("gemini-2.5-flash-lite"),
    tts=elevenlabs.TTS(),
    stt=deepgram.STT(),
    multi_speaker_filter=FirstSpeakerWinsFilter(
        speech_threshold=0.5,
        silence_release_ms=1500.0,
    ),
)
Omitting multi_speaker_filter (or passing None) defaults to FirstSpeakerWinsFilter() with the parameters shown above.

FirstSpeakerWinsFilter Parameters

ParameterTypeDefaultDescription
speech_thresholdfloat0.5Silero VAD score (0.0–1.0) a participant must exceed to acquire the lock
silence_release_msfloat1500.0Milliseconds of silence from the active speaker before releasing the lock
Lock lifecycle:
  1. No lock held — all audio passes through. The first participant whose VAD score exceeds speech_threshold acquires the lock.
  2. Lock held — only the locked speaker’s audio reaches the pipeline. Other participants’ audio is dropped without running VAD (no extra cost).
  3. Silence timeout — if the active speaker goes silent for silence_release_ms, the lock is released.
  4. Turn end — a TurnEndedEvent releases the lock unconditionally.
  5. Participant disconnects — the lock is cleared immediately if that participant held it.

Building a Custom AudioFilter

Replace the default filter with your own by implementing the AudioFilter interface:
from typing import Optional

from getstream.video.rtc import PcmData

from vision_agents.core.edge.types import Participant
from vision_agents.core.utils.audio_filter import AudioFilter


class MyCustomFilter(AudioFilter):
    async def process_audio(
        self, pcm: PcmData, participant: Participant
    ) -> Optional[PcmData]:
        """Return PcmData to pass the audio through, or None to drop it."""
        # Your logic here
        return pcm

    def clear(self, participant: Optional[Participant] = None) -> None:
        """Called on turn end or participant disconnect.

        If participant is provided, only clear state for that participant.
        If None, clear all state unconditionally.
        """
        pass
Then pass it to the agent:
agent = Agent(
    ...,
    multi_speaker_filter=MyCustomFilter(),
)

Best Practices

Tune thresholds for your environment — Lower speech_threshold for quiet speakers; raise it to reject background noise. Adjust silence_release_ms based on expected pause lengths in your use case. Combine with turn detection — The multi-speaker filter gates which speaker’s audio reaches the pipeline. Turn detection determines when the speaker has finished. They work together automatically.

Next Steps