Whisper Speaker Diarization — How to Add Speaker Labels to Whisper (2026)
OpenAI's Whisper transcribes speech but doesn't label who is speaking. To get speaker labels you bolt on a diarization model (pyannote.audio or NVIDIA NeMo) and align it to Whisper's word timestamps. Below: working code for the three main open-source paths (WhisperX, whisper-diarization, raw pyannote), an honest comparison, real failure modes, production deployment realities, and self-host vs managed cost math.
Supported formats:
Does Whisper Do Speaker Diarization?
No. The open-source Whisper model outputs text + segment timestamps, not speaker IDs. The OpenAI whisper-1 hosted API doesn't either.
Whisper is a sequence-to-sequence ASR + translation model. Diarization (speaker embedding + clustering over time) is a different task that needs a separate model family. To get speaker-labeled transcripts you combine Whisper with a diarization model in a multi-stage pipeline.
2026 update: OpenAI released gpt-4o-transcribe-diarize, a separate model on the same audio endpoint that returns speaker-labeled output via a diarized_json format. More on that below.
Reference: openai/whisper Discussion #264 is where Whisper users get pointed for third-party diarization solutions.
The Standard Whisper + Diarization Pipeline
Every working open-source solution follows the same five-stage pipeline. Understanding this lets you debug when speaker labels come out wrong.
Why all five stages matter: skipping forced alignment is the #1 reason laptop tutorials produce garbage speaker labels. Whisper's native segment timestamps are too coarse to attribute words to speakers reliably — you need word-level timing first.
Path 1 — WhisperX (Default Choice)
github.com/m-bain/whisperX · ~22.6k stars · MIT
Architecture: faster-whisper (CTranslate2 backend) + wav2vec2 forced alignment + pyannote diarization + VAD pre-pass. Batched inference, under 8GB VRAM for large-v2, ~70× realtime on a single GPU per their README.
Python (the canonical block every tutorial copies):
import whisperx
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16
# 1-2. VAD + ASR via faster-whisper
model = whisperx.load_model("large-v2", device, compute_type="float16")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 3. Forced alignment for word-level timestamps
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata,
audio, device, return_char_alignments=False)
# 4-5. Diarization + word→speaker assignment
diarize_model = whisperx.DiarizationPipeline(
use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# result["segments"] now contains { start, end, text, speaker }
CLI:
whisperx audio.wav \ --model large-v2 \ --diarize \ --highlight_words True \ --hf_token <YOUR_HF_TOKEN>
Practical notes
- • Requires a Hugging Face token AND accepting terms for
pyannote/speaker-diarization-3.1(or community-1) - •
batch_sizedefault 16 fits under 8GB VRAM with large-v2 — drop to 4-8 if you hit OOM - •
compute_type="float16"halves memory vs float32; useint8for older GPUs - • Pass
min_speakers/max_speakerstodiarize_model(audio, min_speakers=2, max_speakers=4)when you know N — significantly better than auto-detect - • CUDA 12.8 recommended; version pinning headaches between PyTorch / CTranslate2 / pyannote are well-documented
Best for: the default choice. Fast batch processing, multilingual support, most actively maintained, biggest community.
Path 2 — whisper-diarization (No HuggingFace Token)
github.com/MahmoudAshraf97/whisper-diarization · ~5.6k stars · BSD-2
Architecture: Whisper + NVIDIA NeMo (MarbleNet VAD + TitaNet speaker embeddings) + Demucs vocal separation + ctc-forced-aligner + punctuation realignment. Uses NeMo instead of pyannote — no HF token, but heavier (~10GB VRAM recommended).
CLI:
# basic python diarize.py -a audio.wav # with options python diarize.py -a audio.wav \ --whisper-model large-v3 \ --device cuda \ --suppress_numerals
Practical notes
- • No Hugging Face token required — uses NeMo models from NVIDIA NGC instead of pyannote
- • Demucs vocal separation pre-pass strips music/background → cleaner diarization on podcasts and recorded interviews
- • Output: per-speaker SRT + TXT in the same directory as the input
- • Heavier than WhisperX — Demucs + NeMo + Whisper all loaded at once
- • Slower than WhisperX (no batched inference)
Best for: environments where you can't accept Hugging Face model gating (compliance, air-gapped, enterprise procurement); audio with significant background music or noise where the Demucs pre-pass helps.
Path 3 — Roll Your Own (Whisper + Raw pyannote)
If you want full control or are learning the pipeline, build from primitives. This is also the approach if you need to integrate with an existing custom ASR/diarization stack.
import whisper
from pyannote.audio import Pipeline
# 1-2. Transcribe with Whisper (word_timestamps=True is critical)
model = whisper.load_model("large-v3")
asr_result = model.transcribe("audio.wav", word_timestamps=True)
# 3-4. Diarize with pyannote
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("audio.wav")
# 5. Align: map each word's midpoint timestamp to its speaker turn
for segment in asr_result["segments"]:
for word in segment.get("words", []):
midpoint = (word["start"] + word["end"]) / 2
word["speaker"] = "UNKNOWN"
for turn, _, speaker in diarization.itertracks(yield_label=True):
if turn.start <= midpoint <= turn.end:
word["speaker"] = speaker
break
# Now each word has a speaker label
A good reference for a clean class-based version is the Scalastic three-class architecture (separate WhisperAudioTranscriber, PyannoteDiarizer, SpeakerAligner classes) — useful if you're building this into an existing codebase.
Best for: learning, custom pipelines, integrating diarization into an existing ASR system, or when you want to swap individual components (e.g., a custom alignment model).
Honest Comparison — WhisperX vs whisper-diarization vs Raw pyannote
The SERP for "whisper diarization" has both repos in the top 5 but nobody compares them side-by-side. Here's the breakdown.
| Feature | WhisperX | whisper-diarization | Raw pyannote |
|---|---|---|---|
| Stars | ~22.6k | ~5.6k | N/A (library) |
| ASR backend | faster-whisper | Whisper | Your choice |
| Diarization | pyannote | NeMo TitaNet | pyannote |
| Vocal separation | No | Demucs | DIY |
| HF token required | Yes | No | Yes (for pyannote) |
| VRAM (typical) | <8 GB | ~10 GB | Varies |
| Speed (per their docs) | ~70× realtime (batched) | Slower than WhisperX | Slowest (no batching) |
| Setup difficulty | Easy | Easy | Medium |
| Best for | Default choice | No HF gating / noisy audio | Learning / custom pipelines |
OpenAI's gpt-4o-transcribe-diarize (2026)
The biggest recent change in the diarization landscape: OpenAI shipped a new model in 2026, gpt-4o-transcribe-diarize, that bundles ASR + diarization natively. Many existing tutorials (and older blog posts ranking on this query) don't mention it.
- Native diarization via
/v1/audio/transcriptionswithdiarized_jsonresponse format - Supports up to 4
known_speaker_references[](2-10s audio clips) to map speaker IDs to known names - Reported around $0.006/min at launch (verify on platform.openai.com)
- NOT available on the Realtime API yet —
/v1/audio/transcriptionsonly
When to use it: simplest integration if you're already on OpenAI APIs, you don't need custom vocabulary or on-prem deployment, and basic speaker labeling is enough. When it's not enough: if you need consistent speaker identity across files (the "same person in episode 1 and episode 5" problem), custom-trained ASR, or features beyond diarization (summaries, meeting bot, multi-format export).
Pipeline Failure Modes — What Actually Breaks
The READMEs admit these in one-line caveats; in practice they show up constantly. Knowing them up front saves debugging time.
Overlapping speech
Whisper transcribes only the dominant voice. pyannote can flag overlap regions, but the word-to-speaker assignment step can only pick one speaker per word. Both WhisperX and whisper-diarization explicitly admit this in their READMEs. For panel discussions or phone calls with crosstalk, expect noticeably lower accuracy.
Short backchannels (<1 second)
Speaker embeddings are unreliable on under one second of audio. "Yeah", "mhm", "right" tend to get merged into the surrounding speaker. Usually fine — except for therapy, coaching, or research transcripts where listener affirmations matter analytically.
Speaker switches mid-segment
Whisper sometimes emits a long segment that spans two speakers. After forced alignment, the word-level timestamps let you split — but the split usually lands at the nearest clause boundary, not the actual speaker boundary. Result: one or two words attributed to the wrong speaker at every turn.
Words without alignable tokens
Numbers like "2014." or currency like "£13.60" may have no entry in the wav2vec2 alignment dictionary and get dropped from word timestamps entirely. Common in finance, scientific, or technical content. Workarounds: pre-process numbers into spelled-out form, or use a larger alignment model.
Similar voices or distant mics
Embedding clustering merges similar speakers or splits one speaker into multiple clusters. Two co-hosts with similar voice pitch on the same mic, a meeting recorded from a single laptop in the middle of a table — both produce predictable errors. Fix: provide min_speakers / max_speakers when you know N, or use higher-quality close-mic audio per speaker.
Speaker re-identification across files
pyannote and NeMo both treat each file independently. SPEAKER_00 in file 1 is not necessarily SPEAKER_00 in file 2 — they're just cluster IDs. For podcast series, interview archives, or recurring meetings, you need additional speaker recognition (embedding extraction + cosine similarity against a reference set). No popular open-source pipeline does this end-to-end out of the box.
The GPT-4o Post-Processing Hack
From the OpenAI community thread on Whisper diarization — undocumented in any major tutorial: after running WhisperX or similar, send the rough transcript to GPT-4o and ask it to relabel speakers based on conversational cues. Works surprisingly well for content with clear conversational structure (interviews, podcasts where the host asks questions, customer-support calls).
prompt = f"""Below is a transcript with speakers labeled
SPEAKER_00, SPEAKER_01, etc. Based on conversational context
(questions, responses, name mentions, role indicators), relabel
the speakers with their likely roles or names if mentioned.
If you cannot confidently identify a speaker, keep the generic
SPEAKER_XX label rather than guessing.
Return the transcript with corrected speaker labels.
Transcript:
{transcript}
"""
# Then send to GPT-4o via the chat completions API
When it works: structured conversations (host + guest, interviewer + interviewee, agent + customer) where one speaker introduces themselves or asks distinctive questions.
When it doesn't: technical panel discussions, group meetings with multiple peers, audio without role-distinctive language. GPT-4o will sometimes hallucinate confident-sounding names — explicitly instruct it not to guess.
Self-Host Cost Math — Realistic Numbers
Most tutorials hand-wave on cost. Here are concrete ballparks (verify current spot pricing before budgeting):
| GPU | Spot price | WhisperX throughput | Effective cost / audio-hour |
|---|---|---|---|
| T4 | ~$0.30-0.50/hr | ~20-30× realtime | ~$0.01-0.025 |
| A10G | ~$0.50-1.00/hr | ~50-70× realtime | ~$0.01-0.02 |
| RTX 4090 (Vast.ai / RunPod) | ~$0.40-0.70/hr | ~60-80× realtime | ~$0.007-0.015 |
So at scale, self-hosting can land around $0.01-0.03 per audio-hour in raw GPU costs. That sounds cheap — and it is, if you only count GPU time.
What the math hides:
- Initial integration: 1-2 weeks of engineering for a working production pipeline (queueing, retries, monitoring, GPU pool management)
- Ongoing maintenance: dependency pinning between PyTorch / CUDA / CTranslate2 / pyannote / wav2vec2 is genuinely fragile — pyannote ships a new pipeline roughly yearly
- Hugging Face token management for production (token rotation, gated model acceptance for every new pipeline version)
- GPU pool sizing for variable load — cold starts of 30-90s are real
- Engineering time to handle the failure modes above (overlap detection, speaker count hints, retry logic)
Honest break-even ranges vs managed APIs at $0.10-0.30/audio-hour:
- Under ~500 audio-hr/month: managed wins on dev time + reliability
- 500-5,000 audio-hr/month: depends on whether you have ML engineering capacity
- 5,000+ audio-hr/month: self-host wins economically if you have the engineering team
Production Deployment Realities
What every laptop tutorial skips. If you're going past hobby usage:
- • GPU cold starts — loading large-v3 + alignment + pyannote takes 30-90 seconds per cold start. Use warm pools or persistent workers. Don't spin up a fresh container per request.
- • Queueing — for multi-tenant or batch workloads, queue audio files. Don't try to run multiple diarizations in parallel on one GPU beyond
batch_sizelimits — you'll OOM. - • Memory management — keep models loaded across requests; loading per-request adds the cold-start hit every time. Watch out for memory growth in long-running workers.
- • Retries — pyannote pipelines occasionally hang. Implement timeouts + exponential backoff. The Hugging Face Forum has an active thread on this.
- • Dependency pinning — pin PyTorch, CUDA, cuDNN, CTranslate2, pyannote, faster-whisper versions in a Docker image. Update one at a time. Skip a major pyannote release and you'll spend a day re-pinning.
- • Speaker count hints — if you know N speakers, pass it. Auto-detect works but constrained always wins.
- • Multi-region / latency — if you serve users globally, plan GPU pools per region. A diarization job that takes 5 minutes round-trip from Sydney to a US-East GPU pool feels broken even when it's technically working.
Self-Host vs Use a Managed Service — Honest Framework
Self-host when…
- • You have ML engineering capacity (1+ engineer fluent in PyTorch)
- • >1,000 audio-hours/month sustained
- • Data residency requirements (on-prem, EU-only, air-gapped)
- • You need custom models or fine-tuning
- • You're comfortable maintaining Python + CUDA dependencies
Managed wins when…
- • <500 audio-hours/month
- • No dedicated ML engineer on the team
- • You need diarization PLUS other features (summaries, meeting bot, multi-format export)
- • Time-to-market matters more than per-hour cost
- • You'd rather not manage HF tokens, CUDA, and GPU pools
Managed Services for Whisper + Diarization
Pricing changes — verify each before budgeting. As of mid-2026:
- • OpenAI gpt-4o-transcribe-diarize — native diarization on the OpenAI audio API. ~$0.006/min. Simplest if you're already on OpenAI; basic feature set.
- • AssemblyAI Universal-2 — strong all-around. Diarization is a small add-on fee. Good docs.
- • Deepgram Nova-3 — fast, has real-time streaming with diarization. Base ~$0.46/audio-hour PAYG plus diarization add-on.
- • Rev AI Turbo — cheapest mainstream API (~$0.10/audio-hour). Good for batch.
- • Speechmatics — strong accuracy reputation, on-prem option available, enterprise pricing.
- • VexaScribe — Whisper-based service with diarization, AI summaries, and meeting bot bundled. Best fit if you want speaker-labeled transcripts as part of a full transcription product rather than just a diarization API. From $0.30/audio-hour on the $2/month Starter plan.
Where VexaScribe Fits
If you're reading this because you want speaker-labeled transcripts for your own audio (not because you're building a transcription product yourself), VexaScribe is one option that handles the whole pipeline for you.
We're a Whisper-based transcription service that runs the same pipeline this page describes — VAD, Whisper Large-v3 transcription, forced alignment, pyannote-based diarization (up to 50 speakers) — and adds: AI-generated meeting summaries (action items, decisions, blockers, open questions), a Zoom / Google Meet / Microsoft Teams meeting bot for live calls, word-level SRT and VTT export with proper cue splitting (so subtitles import cleanly into Premiere / Final Cut / DaVinci), and 99-language support.
When VexaScribe is a fit
- • You don't want to manage Python + CUDA + pyannote + HF tokens yourself
- • You need diarization PLUS summaries, meeting bot, or multi-format export — not just a raw diarization API
- • Under ~2,000 audio-hours/month (above this, self-host math becomes attractive)
- • You want a product (dashboard + editor + exports), not an API endpoint
When VexaScribe is NOT a fit
- • You're building a developer-facing transcription product — look at Deepgram, AssemblyAI, or OpenAI
- • You need on-prem deployment for compliance
- • You're processing >5,000 audio-hours/month and have ML engineers — self-host wins
- • You only need raw Whisper transcription with no speaker labels — OpenAI's whisper-1 API is cheaper
30 minutes free per month if you want to try the output quality on your own files — no credit card required.
Whisper Diarization FAQ
Does the OpenAI Whisper API include speaker diarization?
The original whisper-1 hosted API does NOT include speaker diarization — it returns text and segment timestamps only. In 2026 OpenAI released gpt-4o-transcribe-diarize, a separate model exposed through the same /v1/audio/transcriptions endpoint, which DOES return speaker-labeled output via a diarized_json response format. It supports up to 4 known_speaker_references[] (2-10s audio clips) to map anonymous speaker IDs to known names. Verify current pricing on platform.openai.com — it was reported around $0.006/min at launch.
WhisperX vs whisper-diarization — which should I use?
WhisperX (github.com/m-bain/whisperX, ~22.6k stars) is the default choice: faster-whisper backend, pyannote diarization, batched inference at ~70× realtime on a single GPU per their README, under 8GB VRAM for large-v2. Requires a Hugging Face token and accepting pyannote model terms. whisper-diarization (github.com/MahmoudAshraf97/whisper-diarization, ~5.6k stars) uses NVIDIA NeMo models (MarbleNet VAD + TitaNet embeddings) plus Demucs vocal separation — no HuggingFace token needed, heavier (~10GB VRAM), but better on audio with background music. Pick WhisperX if you want speed and the most active project. Pick whisper-diarization if you can't use HF-gated models or your audio has music/noise.
Do I need a Hugging Face token for Whisper diarization?
Yes, if you use pyannote.audio (which WhisperX depends on). You need to (1) create a free HF account, (2) accept the gated terms for pyannote/speaker-diarization-3.1 (or community-1), and (3) generate a token at hf.co/settings/tokens. This is the single biggest packaging complaint in the WhisperX issue tracker. To avoid it entirely, use MahmoudAshraf97/whisper-diarization (NeMo-based, no HF token), or use the raw NVIDIA NeMo pipeline directly.
Can I run Whisper diarization locally without a GPU?
Technically yes — set device="cpu" — but it is hours-per-hour-of-audio rather than minutes. Not practical for production. A consumer GPU (RTX 3060/4070, 8GB+) is the realistic minimum for batch processing. For lightweight CPU work, whisper.cpp's tinydiarize (-tdrz flag) provides turn-segmentation only (detects "speaker changed" but does not cluster the same person across the file), which is much cheaper but a different capability.
How accurate is pyannote speaker diarization?
Per pyannote's own benchmark table (pyannote/pyannote-audio README), community-1 reports diarization error rate (DER) of about 17.0% on AMI (meeting audio), 20.2% on DIHARD 3 (hard diverse audio), and 11.2% on VoxConverse (videos). Their commercial precision-2 model is 25-40% lower DER on the same benchmarks. Important caveat: DER numbers vary substantially with dataset version, collar settings, and overlap handling — different sources publish different numbers for what looks like the same test. Always cite the specific benchmark.
Why are speaker labels generic (SPEAKER_00) instead of real names?
Diarization solves "who spoke when" without knowing who anyone is. It clusters voice embeddings and assigns anonymous IDs (SPEAKER_00, SPEAKER_01). To get real names you need an additional step: either (a) provide reference audio for each known speaker and use speaker recognition to map clusters to identities, (b) post-process with an LLM that infers names from conversational cues (the GPT-4o hack covered above), or (c) use OpenAI's gpt-4o-transcribe-diarize with known_speaker_references[]. None of the popular open-source pipelines do speaker recognition end-to-end by default.
Can Whisper detect overlapping speakers?
Whisper itself transcribes only the dominant voice when speakers overlap — it does not produce multi-channel output. pyannote.audio can flag overlap regions, but the word-to-speaker assignment step that follows can only pick one speaker per word. Both the WhisperX and whisper-diarization READMEs explicitly state that overlapping speech is not handled well. For audio with significant overlap (panel discussions, phone calls with crosstalk), expect noticeably lower accuracy than for clean turn-taking conversation.
How do I make Speaker A in file 1 = Speaker A in file 2?
Out of the box with WhisperX or whisper-diarization, you can't — each file is diarized independently and speaker labels are anonymous per file. To get consistent speaker identity across files (e.g., for a podcast series with recurring guests), you need to: (1) extract a speaker embedding for each known person from reference audio, (2) compare your file's diarized speaker embeddings to the reference set using cosine similarity, (3) relabel matching speakers. This is non-trivial to build reliably. Managed services like AssemblyAI's speaker recognition and VexaScribe's account-level speaker library handle this if you don't want to build it yourself.
Related
How Accurate Is Whisper?
Open ASR Leaderboard WER data, real-world accuracy by language, and benchmark caveats.
Whisper Transcription
OpenAI Whisper without the setup — VexaScribe runs the pipeline for you.
Speaker Labels & Diarization
How automatic speaker labels work end-to-end in a transcription product.
What Is ASR?
Plain-English explanation of automatic speech recognition — the technology Whisper is built on.