Skip to main content
Fast-Whisper is a high-performance implementation of OpenAI’s Whisper speech recognition model using CTranslate2. It provides significantly faster inference speeds while maintaining the same accuracy as the original Whisper model. The Fast-Whisper plugin for Vision Agents enables real-time audio transcription with support for multiple model sizes, automatic language detection, and both CPU and GPU acceleration.

Installation

Install the Fast-Whisper plugin with
uv add vision-agents[fast-whisper]

Example

from vision_agents.plugins import fast_whisper
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, openai, elevenlabs

# Create agent with Fast-Whisper STT
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    stt=fast_whisper.STT(model_size="base"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS()
)

Initialization

The Fast-Whisper plugin exists in the form of the STT class:
from vision_agents.plugins import fast_whisper

# Default configuration (base model, auto language detection)
stt = fast_whisper.STT()

# Custom configuration
stt = fast_whisper.STT(
    model_size="medium",
    language="en",
    device="cuda",
    compute_type="float16"
)

Parameters

These are the parameters available in the Fast-Whisper STT plugin:
NameTypeDefaultDescription
model_sizestr"base"Whisper model size to use. Options: "tiny", "base", "small", "medium", "large", "large-v2", "large-v3".
languagestr or NoneNoneLanguage code for transcription (e.g., "en", "es", "fr"). If None, language is automatically detected.
devicestr"cpu"Device to run inference on. Options: "cpu", "cuda", "auto".
compute_typestr"int8"Computation precision. Options: "int8", "float16", "float32". Lower precision = faster inference, less memory.
sample_rateint16000Audio sample rate in Hz. Audio is automatically resampled to this rate.

Features

Fast Inference

Fast-Whisper uses CTranslate2 for optimized inference, providing:
  • 2-4x faster transcription compared to the original Whisper implementation
  • Lower memory usage through quantization support
  • GPU acceleration when available

Model Sizes

Choose the right model size for your use case:
ModelParametersSpeedAccuracyUse Case
tiny39MFastestGoodReal-time, resource-constrained
base74MVery FastBetterGeneral purpose, real-time
small244MFastGoodBalanced speed and accuracy
medium769MModerateBetterHigher accuracy needs
large1550MSlowerBestMaximum accuracy
large-v21550MSlowerBestImproved multilingual support
large-v31550MSlowerBestLatest improvements

Automatic Language Detection

When language is not specified, Fast-Whisper automatically detects the spoken language:
stt = fast_whisper.STT(model_size="base")  # Auto-detect language
For better performance with a known language, specify it explicitly:
stt = fast_whisper.STT(model_size="base", language="en")  # English only

GPU Acceleration

Enable GPU acceleration for faster transcription:
stt = fast_whisper.STT(
    model_size="medium",
    device="cuda",
    compute_type="float16"  # Use float16 for GPU
)
GPU acceleration requires CUDA to be installed. If CUDA is not available, the plugin will automatically fall back to CPU.

Functionality

Process Audio

The process_audio() method processes incoming audio data and emits transcription events:
from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:
    @connection.on("audio")
    async def on_audio(pcm: PcmData, user):
        await stt.process_audio(pcm, user)

Events

The plugin emits standard Vision Agents STT events:

Transcript Event

Fired when a transcription is completed:
from vision_agents.core.stt.events import TranscriptEvent

@stt.events.on(TranscriptEvent)
async def on_transcript(event: TranscriptEvent):
    print(f"User {event.user_id} said: {event.text}")
    print(f"Language: {event.language}")
    print(f"Confidence: {event.confidence}")

Error Event

Fired when an error occurs during transcription:
from vision_agents.core.stt.events import STTErrorEvent

@stt.events.on(STTErrorEvent)
async def on_error(event: STTErrorEvent):
    print(f"Transcription error: {event.error_message}")

Performance Optimization

CPU Optimization

For CPU-only environments, use int8 quantization for best performance:
stt = fast_whisper.STT(
    model_size="base",
    device="cpu",
    compute_type="int8"
)

GPU Optimization

For GPU environments, use float16 for optimal speed and accuracy:
stt = fast_whisper.STT(
    model_size="medium",
    device="cuda",
    compute_type="float16"
)

Model Selection

  • Real-time applications: Use tiny or base models
  • Balanced use cases: Use small or medium models
  • Maximum accuracy: Use large-v3 model (requires more resources)

Usage with Agent

Use Fast-Whisper as part of a complete voice agent:
from vision_agents.core import Agent, User
from vision_agents.plugins import fast_whisper, getstream, openai, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Voice Assistant", id="agent"),
    instructions="You are a helpful voice assistant.",
    stt=fast_whisper.STT(
        model_size="base",
        language="en",
        device="cpu",
        compute_type="int8"
    ),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS()
)

# Join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    await agent.finish()

Model Downloads

On first run, Fast-Whisper downloads the selected model from Hugging Face. Models are cached locally to avoid repeated downloads:
  • tiny: ~39 MB
  • base: ~74 MB
  • small: ~244 MB
  • medium: ~769 MB
  • large: ~1.5 GB
The first initialization may take a few seconds while the model is downloaded.

Supported Languages

Fast-Whisper supports 99 languages, including:
  • English (en)
  • Spanish (es)
  • French (fr)
  • German (de)
  • Italian (it)
  • Portuguese (pt)
  • Dutch (nl)
  • Russian (ru)
  • Chinese (zh)
  • Japanese (ja)
  • Korean (ko)
  • And many more…
See the Whisper documentation for the complete list.