Enterprise AI company Cohere launched Transcribe on March 26, 2026, its first voice model designed specifically for automatic speech recognition tasks. The 2-billion parameter model leads the HuggingFace Open ASR Leaderboard with a 5.42% average word error rate, outperforming both open and closed-source alternatives including OpenAI's Whisper and ElevenLabs Scribe.
Dedicated ASR Architecture Outperforms Multimodal Alternatives
Transcribe is an audio-in, text-out dedicated ASR model optimized specifically for transcription tasks like note-taking and speech analysis. Unlike general-purpose multimodal models, this focused architecture delivers efficiency advantages:
- Real-time factor up to 3x faster than other dedicated ASR models in the same size range
- 5.42% average word error rate across supported languages
- Supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean
The model is available under Apache 2.0 license on Hugging Face and through Cohere's API for free. Cohere plans to integrate Transcribe into its enterprise agent orchestration platform, North.
Known Limitations in Language Detection and Code-Switching
Cohere openly acknowledged several limitations in the initial release:
- No explicit automatic language detection capability
- Inconsistent performance on code-switched audio mixing multiple languages
- No timestamp or speaker diarization features
Despite these constraints, Transcribe's focused approach on transcription accuracy represents a strategic entry into the speech recognition market. By releasing the model as open-source, Cohere provides developers with a high-quality alternative to proprietary solutions while establishing a foundation for enterprise applications through its API and planned North platform integration.
Key Takeaways
- Cohere launched Transcribe on March 26, 2026, a 2-billion parameter open-source speech recognition model that leads the HuggingFace Open ASR Leaderboard with 5.42% word error rate
- The model outperforms OpenAI's Whisper and ElevenLabs Scribe while running up to 3x faster than other dedicated ASR models in its size range
- Transcribe supports 14 languages and is available under Apache 2.0 license on Hugging Face and free through Cohere's API
- The dedicated ASR architecture focuses specifically on transcription rather than general-purpose audio processing, enabling efficiency advantages
- Known limitations include lack of automatic language detection, inconsistent code-switching performance, and no timestamp or speaker diarization features