What It Is

Speech recognition, formally known as automatic speech recognition (ASR), is the technology that converts spoken language into written text. Modern ASR systems use deep learning models trained on hundreds of thousands of hours of audio to achieve near-human accuracy in many scenarios. The technology is foundational to voice assistants (Siri, Alexa, Google Assistant), real-time transcription, call center analytics, and accessibility tools.

The market for speech recognition exceeded $15 billion in 2025. OpenAI's Whisper model (open-source), Google's Universal Speech Model, and commercial offerings from Deepgram, AssemblyAI, and Rev.ai have made high-quality transcription accessible and affordable.

How It Works

Modern ASR systems use end-to-end neural network architectures that directly map audio signals to text:

Audio processing — raw audio is converted into a spectrogram — a visual representation of frequencies over time. Mel-frequency spectrograms are the standard input representation, capturing the acoustic features relevant to speech while discarding irrelevant information.

Encoder — a neural network (typically a transformer or conformer) processes the spectrogram and produces a sequence of hidden representations capturing the acoustic and linguistic content. The encoder learns to handle accent variation, background noise, and speaking speed.

Decoder — generates text from the encoder's representations. In connectionist temporal classification (CTC) models, the decoder directly predicts characters or word pieces at each time step. In attention-based models (like Whisper), the decoder autoregressively generates tokens while attending to the encoder output.

Language model integration — ASR systems often incorporate language models that improve transcription accuracy by predicting likely word sequences. A language model knows that "recognize speech" is more probable than "wreck a nice beach" — even though they sound similar.

Key models:

  • Whisper (OpenAI) — trained on 680,000 hours of multilingual audio, Whisper handles 99 languages with strong accuracy. Its open-source release in 2022 democratized high-quality ASR.
  • Universal Speech Model (Google) — trained on 12 million hours of speech across 300+ languages, targeting low-resource languages.
  • Conformer — a hybrid architecture combining convolutional layers (for local audio patterns) with transformer layers (for global context), achieving state-of-the-art accuracy in many benchmarks.

Key Capabilities

Real-time transcription — streaming ASR processes audio in real time, producing text with sub-second latency. This powers live captions for video calls (Zoom, Teams, Google Meet), TV broadcasts, and classroom lectures. Real-time ASR requires models that produce results incrementally without waiting for the speaker to finish.

Speaker diarization — identifying who is speaking when in multi-speaker audio. Diarization models separate overlapping speakers and label segments, essential for meeting transcription and call analytics. Combining diarization with ASR produces speaker-attributed transcripts.

Punctuation and formatting — raw ASR output lacks punctuation, capitalization, and paragraph breaks. Post-processing models add these elements using NLP techniques, producing readable transcripts.

Code-switching — handling speakers who switch between languages mid-sentence, common in multilingual communities. Modern multilingual models handle code-switching better than language-specific models.

Applications

Voice assistants — Siri, Alexa, and Google Assistant use ASR as their front-end, converting voice commands into text for natural language understanding processing. Accuracy in noisy, far-field conditions (across a room) has improved dramatically with beamforming microphones and noise-robust models.

Contact centers — call center analytics platforms transcribe millions of customer calls and analyze them for sentiment, compliance, and quality. Companies like NICE, Verint, and Observe.AI use ASR to convert voice interactions into searchable, analyzable data. Real-time agent assist tools listen to calls and surface relevant information.

Healthcare — medical dictation systems (Nuance DAX, Suki) transcribe doctor-patient conversations into structured clinical notes, reducing documentation burden. Accuracy for medical terminology is critical — ASR must correctly distinguish "hypertension" from "hypotension."

Media and content — YouTube auto-generates captions for over 1 billion videos. Podcast transcription services (Descript, Otter.ai) make audio content searchable and editable. Media monitoring services transcribe broadcast content for analysis.

Accessibility — real-time captions enable deaf and hard-of-hearing individuals to participate in conversations, meetings, and events. Google's Live Transcribe and Apple's Live Captions bring ASR to everyday communication.

Legal and compliance — court reporting, deposition transcription, and regulatory surveillance all use ASR to convert spoken proceedings to text. Accuracy requirements in legal contexts demand specialized models with low word error rates.

Accuracy and Metrics

ASR quality is measured by word error rate (WER) — the percentage of words incorrectly transcribed (substitutions + insertions + deletions) relative to total reference words. State-of-the-art systems achieve:

  • 3-5% WER on clean English speech (near human parity)
  • 8-15% WER on conversational speech with multiple speakers
  • 15-30% WER on noisy, accented, or domain-specific audio
  • Wide variation across languages, with low-resource languages seeing much higher error rates

Human transcriptionists typically achieve 4-5% WER on clean speech, so AI has reached rough parity in optimal conditions.

Challenges

  • Accent and dialect diversity — ASR models perform best on accents well-represented in training data (typically standard American and British English). Performance degrades for underrepresented accents, dialects, and English as a second language. This creates equity concerns when ASR is used for hiring, assessment, or accessibility.
  • Noisy environments — background noise, music, cross-talk, and reverberation degrade accuracy significantly. While noise-robust models have improved, real-world conditions remain challenging.
  • Domain vocabulary — medical, legal, and technical speech includes terminology absent from general training data. Domain-specific fine-tuning or custom vocabulary lists are required for professional applications.
  • Privacy — voice data is biometric and personally identifiable. Always-on voice assistants raise surveillance concerns. On-device ASR (processing audio locally rather than in the cloud) addresses some privacy concerns but requires powerful edge computing.
  • Low-resource languages — the world's 7,000+ languages are extremely unevenly represented in ASR training data. Most AI speech recognition works well in only 20-30 languages, leaving billions of speakers underserved.