Fast-Whisper

Fast-Whisper is a high-performance implementation of OpenAI’s Whisper speech recognition model using CTranslate2. It provides significantly faster inference speeds while maintaining the same accuracy as the original Whisper model. The Fast-Whisper plugin for Vision Agents enables real-time audio transcription with support for multiple model sizes, automatic language detection, and both CPU and GPU acceleration.

Installation

Install the Fast-Whisper plugin with

uv add vision-agents[fast-whisper]

Example

from vision_agents.plugins import fast_whisper
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, openai, elevenlabs

# Create agent with Fast-Whisper STT
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    stt=fast_whisper.STT(model_size="base"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS()
)

Initialization

The Fast-Whisper plugin exists in the form of the STT class:

from vision_agents.plugins import fast_whisper

# Default configuration (base model, auto language detection)
stt = fast_whisper.STT()

# Custom configuration
stt = fast_whisper.STT(
    model_size="medium",
    language="en",
    device="cuda",
    compute_type="float16"
)

Parameters

These are the parameters available in the Fast-Whisper STT plugin:

Name	Type	Default	Description
`model_size`	`str`	`"base"`	Whisper model size to use. Options: `"tiny"`, `"base"`, `"small"`, `"medium"`, `"large"`, `"large-v2"`, `"large-v3"`.
`language`	`str` or `None`	`None`	Language code for transcription (e.g., `"en"`, `"es"`, `"fr"`). If `None`, language is automatically detected.
`device`	`str`	`"cpu"`	Device to run inference on. Options: `"cpu"`, `"cuda"`, `"auto"`.
`compute_type`	`str`	`"int8"`	Computation precision. Options: `"int8"`, `"float16"`, `"float32"`. Lower precision = faster inference, less memory.
`sample_rate`	`int`	`16000`	Audio sample rate in Hz. Audio is automatically resampled to this rate.

Features

Fast Inference

Fast-Whisper uses CTranslate2 for optimized inference, providing:

2-4x faster transcription compared to the original Whisper implementation
Lower memory usage through quantization support
GPU acceleration when available

Model Sizes

Choose the right model size for your use case:

Model	Parameters	Speed	Accuracy	Use Case
`tiny`	39M	Fastest	Good	Real-time, resource-constrained
`base`	74M	Very Fast	Better	General purpose, real-time
`small`	244M	Fast	Good	Balanced speed and accuracy
`medium`	769M	Moderate	Better	Higher accuracy needs
`large`	1550M	Slower	Best	Maximum accuracy
`large-v2`	1550M	Slower	Best	Improved multilingual support
`large-v3`	1550M	Slower	Best	Latest improvements

Automatic Language Detection

When language is not specified, Fast-Whisper automatically detects the spoken language:

stt = fast_whisper.STT(model_size="base")  # Auto-detect language

For better performance with a known language, specify it explicitly:

stt = fast_whisper.STT(model_size="base", language="en")  # English only

GPU Acceleration

Enable GPU acceleration for faster transcription:

stt = fast_whisper.STT(
    model_size="medium",
    device="cuda",
    compute_type="float16"  # Use float16 for GPU
)

GPU acceleration requires CUDA to be installed. If CUDA is not available, the plugin will automatically fall back to CPU.

Functionality

Process Audio

The process_audio() method processes incoming audio data and emits transcription events:

from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:
    @connection.on("audio")
    async def on_audio(pcm: PcmData, user):
        await stt.process_audio(pcm, user)

Events

The plugin emits standard Vision Agents STT events:

Transcript Event

Fired when a transcription is completed:

from vision_agents.core.stt.events import TranscriptEvent

@stt.events.on(TranscriptEvent)
async def on_transcript(event: TranscriptEvent):
    print(f"User {event.user_id} said: {event.text}")
    print(f"Language: {event.language}")
    print(f"Confidence: {event.confidence}")

Error Event

Fired when an error occurs during transcription:

from vision_agents.core.stt.events import STTErrorEvent

@stt.events.on(STTErrorEvent)
async def on_error(event: STTErrorEvent):
    print(f"Transcription error: {event.error_message}")

Performance Optimization

CPU Optimization

For CPU-only environments, use int8 quantization for best performance:

stt = fast_whisper.STT(
    model_size="base",
    device="cpu",
    compute_type="int8"
)

GPU Optimization

For GPU environments, use float16 for optimal speed and accuracy:

stt = fast_whisper.STT(
    model_size="medium",
    device="cuda",
    compute_type="float16"
)

Model Selection

Real-time applications: Use tiny or base models
Balanced use cases: Use small or medium models
Maximum accuracy: Use large-v3 model (requires more resources)

Usage with Agent

Use Fast-Whisper as part of a complete voice agent:

from vision_agents.core import Agent, User
from vision_agents.plugins import fast_whisper, getstream, openai, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Voice Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    stt=fast_whisper.STT(
        model_size="base",
        language="en",
        device="cpu",
        compute_type="int8"
    ),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS()
)

# Join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    await agent.finish()

Model Downloads

On first run, Fast-Whisper downloads the selected model from Hugging Face. Models are cached locally to avoid repeated downloads:

tiny: ~39 MB
base: ~74 MB
small: ~244 MB
medium: ~769 MB
large: ~1.5 GB

The first initialization may take a few seconds while the model is downloaded.

Supported Languages

Fast-Whisper supports 99 languages, including:

English (en)
Spanish (es)
French (fr)
German (de)
Italian (it)
Portuguese (pt)
Dutch (nl)
Russian (ru)
Chinese (zh)
Japanese (ja)
Korean (ko)
And many more…

See the Whisper documentation for the complete list.

Overview

AI Providers

Custom Integrations

Installation

Example

Initialization

Parameters

Features

Fast Inference

Model Sizes

Automatic Language Detection

GPU Acceleration

Functionality

Process Audio

Events

Transcript Event

Error Event

Performance Optimization

CPU Optimization

GPU Optimization

Model Selection

Usage with Agent

Model Downloads

Supported Languages

Links

Overview

AI Providers

Custom Integrations

​Installation

​Example

​Initialization

​Parameters

​Features

​Fast Inference

​Model Sizes

​Automatic Language Detection

​GPU Acceleration

​Functionality

​Process Audio

​Events

​Transcript Event

​Error Event

​Performance Optimization

​CPU Optimization

​GPU Optimization

​Model Selection

​Usage with Agent

​Model Downloads

​Supported Languages

​Links

Installation

Example

Initialization

Parameters

Features

Fast Inference

Model Sizes

Automatic Language Detection

GPU Acceleration

Functionality

Process Audio

Events

Transcript Event

Error Event

Performance Optimization

CPU Optimization

GPU Optimization

Model Selection

Usage with Agent

Model Downloads

Supported Languages

Links