Cohere Releases Transcribe: Open-Source Speech Recognition Model With 5.42% Word Error Rate

Enterprise AI company Cohere launched Transcribe on March 26, 2026, its first voice model designed specifically for automatic speech recognition tasks. The 2-billion parameter model leads the HuggingFace Open ASR Leaderboard with a 5.42% average word error rate, outperforming both open and closed-source alternatives including OpenAI's Whisper and ElevenLabs Scribe.

Dedicated ASR Architecture Outperforms Multimodal Alternatives

Transcribe is an audio-in, text-out dedicated ASR model optimized specifically for transcription tasks like note-taking and speech analysis. Unlike general-purpose multimodal models, this focused architecture delivers efficiency advantages:

Real-time factor up to 3x faster than other dedicated ASR models in the same size range
5.42% average word error rate across supported languages
Supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean

The model is available under Apache 2.0 license on Hugging Face and through Cohere's API for free. Cohere plans to integrate Transcribe into its enterprise agent orchestration platform, North.

Known Limitations in Language Detection and Code-Switching

Cohere openly acknowledged several limitations in the initial release:

No explicit automatic language detection capability
Inconsistent performance on code-switched audio mixing multiple languages
No timestamp or speaker diarization features

Despite these constraints, Transcribe's focused approach on transcription accuracy represents a strategic entry into the speech recognition market. By releasing the model as open-source, Cohere provides developers with a high-quality alternative to proprietary solutions while establishing a foundation for enterprise applications through its API and planned North platform integration.

Key Takeaways

Cohere launched Transcribe on March 26, 2026, a 2-billion parameter open-source speech recognition model that leads the HuggingFace Open ASR Leaderboard with 5.42% word error rate
The model outperforms OpenAI's Whisper and ElevenLabs Scribe while running up to 3x faster than other dedicated ASR models in its size range
Transcribe supports 14 languages and is available under Apache 2.0 license on Hugging Face and free through Cohere's API
The dedicated ASR architecture focuses specifically on transcription rather than general-purpose audio processing, enabling efficiency advantages
Known limitations include lack of automatic language detection, inconsistent code-switching performance, and no timestamp or speaker diarization features

Dedicated ASR Architecture Outperforms Multimodal Alternatives

Real-time factor up to 3x faster than other dedicated ASR models in the same size range

5.42% average word error rate across supported languages

Supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean

The model is available under Apache 2.0 license on Hugging Face and through Cohere's API for free. Cohere plans to integrate Transcribe into its enterprise agent orchestration platform, North.

Known Limitations in Language Detection and Code-Switching

Cohere openly acknowledged several limitations in the initial release:

No explicit automatic language detection capability

Inconsistent performance on code-switched audio mixing multiple languages

No timestamp or speaker diarization features

Key Takeaways

Cohere launched Transcribe on March 26, 2026, a 2-billion parameter open-source speech recognition model that leads the HuggingFace Open ASR Leaderboard with 5.42% word error rate

The model outperforms OpenAI's Whisper and ElevenLabs Scribe while running up to 3x faster than other dedicated ASR models in its size range

Transcribe supports 14 languages and is available under Apache 2.0 license on Hugging Face and free through Cohere's API

The dedicated ASR architecture focuses specifically on transcription rather than general-purpose audio processing, enabling efficiency advantages

Known limitations include lack of automatic language detection, inconsistent code-switching performance, and no timestamp or speaker diarization features