SoulX-Transcriber: End-to-End Multi-Speaker Transcription Achieves 2.89% DER on AISHELL-4

SoulX-Transcriber is an end-to-end framework for multi-speaker speech transcription that simultaneously handles speaker identification, timing, and text recognition in a single unified model. Built on the Qwen3-Omni-30B architecture, the system achieved 2.89% Diarization Error Rate (DER) and 14.16% Word Error Rate (WER) on AISHELL-4 benchmarks, substantially outperforming comparable systems.

Unified Framework Eliminates Cascaded Pipeline Errors

The core innovation of SoulX-Transcriber is jointly modeling who spoke, when, and what was said—producing speaker labels, timestamps, and transcribed text in a single pass rather than through separate sequential modules. This end-to-end approach eliminates errors introduced by traditional cascaded processing pipelines where speaker diarization, timestamp extraction, and speech recognition are handled independently.

The system produces three coordinated outputs simultaneously: speaker labels identifying each participant, precise timestamps for when each speaker's segment begins and ends, and the transcribed text content. This unified modeling allows the system to better handle overlapping speech and rapid speaker transitions.

Speaker-Aware Multi-Stage Training Strengthens Performance

SoulX-Transcriber uses speaker-aware multi-stage training that combines continued pre-training with supervised fine-tuning to strengthen speaker representation. This training approach improves the model's robustness when handling overlapping speech and rapid speaker transitions—common challenges in real-world multi-speaker scenarios.

A unique feature is speaker characteristics-driven audio matching for dialogue synthesis, which selects contextually appropriate reference audio based on speaker characteristics. This enables the system to maintain speaker consistency even in complex conversational contexts.

Benchmark Results Demonstrate State-of-the-Art Performance

On the AISHELL-4 benchmark, SoulX-Transcriber achieved:

2.89% Diarization Error Rate (DER)
14.16% Word Error Rate (WER)
5.39% DER on AliMeeting dataset

These results substantially outperform comparable systems including Gemini and Qwen models. The system was evaluated across multiple domains including social conversations, drama, and podcast content, demonstrating robustness across diverse audio scenarios.

Open Source Release With Pre-Trained Weights

The project was created June 2, 2026, with an arXiv paper (2606.02400) and pre-trained weights available on Hugging Face and ModelScope. The GitHub repository gained 191 stars within the first few days of release, indicating strong community interest in the end-to-end multi-speaker transcription approach.

Key Takeaways

SoulX-Transcriber jointly models speaker identity, timing, and transcription in a single end-to-end framework, eliminating cascaded pipeline errors
The system achieved 2.89% DER and 14.16% WER on AISHELL-4, outperforming comparable systems including Gemini and Qwen models
Built on Qwen3-Omni-30B architecture with speaker-aware multi-stage training for improved robustness to overlapping speech
Speaker characteristics-driven audio matching enables contextually appropriate dialogue synthesis with speaker consistency
Pre-trained weights available on Hugging Face and ModelScope with arXiv paper 2606.02400 published June 2, 2026

Unified Framework Eliminates Cascaded Pipeline Errors

Speaker-Aware Multi-Stage Training Strengthens Performance

Benchmark Results Demonstrate State-of-the-Art Performance

On the AISHELL-4 benchmark, SoulX-Transcriber achieved:

2.89% Diarization Error Rate (DER)

14.16% Word Error Rate (WER)

5.39% DER on AliMeeting dataset

Open Source Release With Pre-Trained Weights

Key Takeaways

SoulX-Transcriber jointly models speaker identity, timing, and transcription in a single end-to-end framework, eliminating cascaded pipeline errors

The system achieved 2.89% DER and 14.16% WER on AISHELL-4, outperforming comparable systems including Gemini and Qwen models

Built on Qwen3-Omni-30B architecture with speaker-aware multi-stage training for improved robustness to overlapping speech

Speaker characteristics-driven audio matching enables contextually appropriate dialogue synthesis with speaker consistency

Pre-trained weights available on Hugging Face and ModelScope with arXiv paper 2606.02400 published June 2, 2026