SoulX-Transcriber Unifies Speaker Diarization and Speech Recognition in Single Model

Soul AI Lab released SoulX-Transcriber, an end-to-end large audio language model that jointly performs speaker diarization and speech recognition in a unified framework. Released around June 2, 2026, the model addresses a longstanding challenge in multi-speaker transcription by directly learning speaker attribution, timestamped segmentation, and transcription simultaneously rather than using separate cascaded models.

End-to-End Architecture Eliminates Error Compounding

Traditional multi-speaker transcription systems use a two-stage pipeline: first identifying who spoke when through speaker diarization, then transcribing what was said through automatic speech recognition. Errors from the first stage compound in the second, particularly in challenging scenarios with overlapping speech or rapid turn-taking. SoulX-Transcriber's unified approach jointly optimizes both tasks, allowing the model to leverage transcription context to improve speaker identification and vice versa.

The model produces structured outputs containing timestamps, speaker labels, and transcripts in a single pass. This architecture proves especially effective for conversations with overlapping speech, fast turn-taking, and similar-sounding voices—scenarios where cascaded approaches typically struggle.

Built on Qwen3-Omni With Two-Stage Training

SoulX-Transcriber builds on top of Qwen3-Omni-30B-A3B-Instruct, implementing a two-stage training strategy. The first stage uses speaker-aware multi-task continuous pre-training to enhance speaker representation learning and boundary perception. The second stage applies supervised fine-tuning to optimize for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions.

The model achieves superior performance on industry benchmarks including AISHELL-4, a Chinese multi-speaker dataset, and AliMeeting, a meeting transcription benchmark. These results demonstrate the effectiveness of joint modeling for real-world multi-speaker scenarios.

Open-Source Release With Pre-Trained Weights

Soul AI Lab released SoulX-Transcriber as open-source software with pre-trained model weights available on both GitHub and Hugging Face. The GitHub repository accumulated 164 stars within approximately two days of release. An associated arXiv paper (2606.02400) details the technical architecture and training methodology.

The release includes implementation code and pre-trained weights, enabling researchers and developers to deploy multi-speaker transcription capabilities for applications including meeting transcription with speaker labels, podcast and interview transcription, call center analytics, and multi-speaker subtitle generation.

Practical Applications for Meeting and Media Transcription

The unified approach particularly benefits use cases requiring accurate speaker attribution in complex audio environments. Meeting transcription systems can now maintain speaker consistency across rapid exchanges without separate diarization and recognition stages. Podcast and interview platforms can generate accurate transcripts with proper speaker labels even when participants interrupt or speak simultaneously.

Call center analytics applications benefit from improved accuracy in scenarios with background noise and overlapping speech. Media production workflows can generate multi-speaker subtitles more reliably, reducing manual correction time for content with multiple speakers.

Key Takeaways

SoulX-Transcriber unifies speaker diarization and speech recognition in a single end-to-end model, eliminating error compounding from cascaded approaches
The model builds on Qwen3-Omni-30B-A3B-Instruct with a two-stage training strategy enhancing speaker representation and boundary perception
SoulX-Transcriber achieves superior performance on AISHELL-4 and AliMeeting benchmarks for multi-speaker transcription
The model outputs structured results containing timestamps, speaker labels, and transcripts in a single forward pass
Soul AI Lab released the model as open-source with pre-trained weights on GitHub and Hugging Face, accumulating 164 stars within two days

End-to-End Architecture Eliminates Error Compounding

Built on Qwen3-Omni With Two-Stage Training

Open-Source Release With Pre-Trained Weights

Practical Applications for Meeting and Media Transcription

Key Takeaways

SoulX-Transcriber unifies speaker diarization and speech recognition in a single end-to-end model, eliminating error compounding from cascaded approaches

The model builds on Qwen3-Omni-30B-A3B-Instruct with a two-stage training strategy enhancing speaker representation and boundary perception

SoulX-Transcriber achieves superior performance on AISHELL-4 and AliMeeting benchmarks for multi-speaker transcription

The model outputs structured results containing timestamps, speaker labels, and transcripts in a single forward pass

Soul AI Lab released the model as open-source with pre-trained weights on GitHub and Hugging Face, accumulating 164 stars within two days