In Depth
Whisper is an automatic speech recognition (ASR) system released by OpenAI in September 2022. Trained on 680,000 hours of multilingual audio data collected from the web, it can transcribe speech in over 90 languages and translate from those languages into English. Its broad training data gives it robustness to accents, background noise, and technical language.
Unlike many commercial speech recognition systems, Whisper is open-source and can be run locally, making it attractive for privacy-sensitive applications. It comes in multiple sizes from tiny (39M parameters) to large (1.5B parameters), allowing users to trade off between accuracy and speed based on their needs.
Whisper has become a foundational component in many AI pipelines, particularly for applications involving meeting transcription, podcast indexing, subtitle generation, and voice-based interfaces. Its open-source nature has spawned numerous optimized variants like Faster Whisper and WhisperX that improve speed and add features like speaker diarization.