Skip to main content
Smart Turn is a turn detection system powered by FAL AI that intelligently detects when a speaker has completed their turn in a conversation. It analyzes audio in real time to determine whether speech is incomplete or complete, enabling natural conversation flow in voice agents. With the Vision Agents SDK you can use Smart Turn to manage conversational turns in your video calls with just a few lines of code.

Installation

Install the Smart Turn plugin with
uv add vision-agents[smart_turn]

Example

Check out our simple agent example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.
from vision_agents.core import Agent
from vision_agents.plugins import smart_turn

# Create turn detection with custom settings
turn_detection = smart_turn.TurnDetection(
    buffer_duration=2.0,
    confidence_threshold=0.5
)

# Use with an agent
agent = Agent(
    turn_detection=turn_detection,
    # ... other agent configuration
)

# Or use standalone
turn_detection.start()

# Listen for turn events
@turn_detection.events.on("turn_started")
async def on_turn_started(event):
    print(f"User {event.speaker_id} started speaking")

@turn_detection.events.on("turn_ended")
async def on_turn_ended(event):
    print(f"User {event.speaker_id} finished speaking")

# Process audio from your call
await turn_detection.process_audio(pcm_data, user_id)

# Stop when finished
turn_detection.stop()

Initialisation

First, make sure you’ve created an API key for the FAL.ai service and set the FAL_KEY environment variable to your API key. The Smart Turn plugin is exposed via the TurnDetection class:
from vision_agents.plugins import smart_turn

# Default settings
turn_detection = smart_turn.TurnDetection()

# Custom settings for more aggressive turn detection
turn_detection = smart_turn.TurnDetection(
    buffer_duration=1.5,
    confidence_threshold=0.7
)

# Start detection
turn_detection.start()

Parameters

You can customise the behaviour of Smart Turn through the following parameters:
NameTypeDefaultDescription
api_keystr|NoneNoneFAL API key. If None, uses FAL_KEY environment variable.
buffer_durationfloat2.0Duration in seconds to buffer audio before processing.
confidence_thresholdfloat0.5Probability threshold (0.0–1.0) for determining turn completion.
sample_rateint16000Audio sample rate in Hz for processing (audio is resampled automatically).
channelsint1Number of audio channels (mono).

Functionality

Process Audio

After joining a Stream call, pass audio directly to Smart Turn for turn detection. The plugin automatically handles buffering, resampling and turn prediction:
from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def _on_audio(pcm: PcmData, user):
        await turn_detection.process_audio(pcm, user.id)

Start and Stop

Control turn detection with the start() and stop() methods:
# Start turn detection
turn_detection.start()

# Check if detection is active
if turn_detection.is_detecting():
    print("Turn detection is active")

# Stop turn detection
turn_detection.stop()

Events

The plugin emits turn detection events through the Vision Agents event system:

Turn Started Event

Fired when a user begins speaking:
@turn_detection.events.on("turn_started")
async def on_turn_started(event):
    print(f"Turn started by {event.speaker_id}")
    print(f"Confidence: {event.confidence}")

Turn Ended Event

Fired when a user completes their turn (based on the model’s prediction and confidence threshold):
@turn_detection.events.on("turn_ended")
async def on_turn_ended(event):
    print(f"Turn ended by {event.speaker_id}")
    print(f"Confidence: {event.confidence}")
    # Access additional FAL model data
    if event.custom:
        print(f"Prediction: {event.custom.get('prediction')}")

Event Properties

Both TurnStartedEvent and TurnEndedEvent include the following properties:
PropertyTypeDescription
speaker_idstrID of the user whose turn started or ended.
confidencefloat|NoneConfidence level of the turn detection (0.0–1.0).
durationfloat|NoneDuration of the turn in seconds (if available).
customdict|NoneAdditional data from the FAL model (prediction, etc).
I