By VexaScribe Editorial · Published April 2, 2026 · Verified
Best Speaker Diarization Tools in 2026 (Apps, APIs & Open-Source)
As of June 2026, we compare 20 speaker diarization tools across four categories: consumer apps for non-technical users, developer APIs for building speech applications, open-source libraries for self-hosted solutions, and the 2024-2026 model generation (AssemblyAI Universal-2, Deepgram Nova-3, Speechmatics Ursa, NVIDIA NeMo Sortformer). Fireflies.ai has the highest benchmark accuracy at 92.8% across 500+ hours of testing. VexaScribe is the cheapest consumer tool with auto-diarization on every plan ($2/month). Riverside gives 100% accuracy by recording each speaker on a separate track. For developers, AssemblyAI Universal-2 ($0.006/min) and Deepgram Nova-3 ($0.0043/min) lead on accuracy and price respectively. For open-source self-hosting, pyannote.audio 3.1+ remains the gold standard (21.7% DER on DIHARD III, 18.8% on AMI).
Diarization Error Rate (DER) = (false alarm speech + missed speech + speaker confusion) / total reference speech time. The NIST-standard accuracy metric for diarization. Below 10% DER is considered good on most domains. Reference benchmarks: DIHARD III (Third DIHARD Challenge, 2020), AMI meeting corpus, VoxConverse, and CallHome. Methodology and sources below.
Quick Decision Rule:
- • Cheapest auto-diarization → VexaScribe ($2/mo, all plans)
- • Highest benchmark accuracy → Fireflies (92.8%) or Riverside (100% via separate tracks)
- • Voice profiles for recurring speakers → Otter.ai (learns your contacts)
- • Developer API → AssemblyAI ($0.17/hr) or Deepgram ($0.58/hr)
- • Open-source self-hosted → pyannote 3.1 (free, GPU required)
- • Perfect accuracy (no AI errors) → Riverside (separate tracks) or Rev Human ($1.99/min)
- • 50+ speakers in one recording → Fireflies or Deepgram
Disclosure: VexaScribe is our product. We recommend it for users who need affordable multi-speaker transcription without tier-gating — diarization is included on all plans starting at $2/mo. We acknowledge Fireflies has higher benchmark accuracy (92.8% vs. our unverified accuracy), Otter.ai has voice profiles we don't offer, Riverside provides 100% accuracy via separate tracks, and pyannote.audio + NVIDIA NeMo Sortformer outperform our hosted diarization on academic benchmarks (DIHARD III, AMI) when self-hosted by technical users. Pricing and competitor models cross-verified on official sites and the latest Hugging Face Open ASR Leaderboard on June 6, 2026.
Key Takeaways
- • Cheapest diarization: VexaScribe — $2/mo with auto-diarization on every plan (no tier-gating)
- • Highest benchmark accuracy: Fireflies.ai — 92.8% overall (7.2% DER) across 500+ hours
- • Perfect accuracy: Riverside — 0% DER via separate track recording (remote calls only)
- • Best voice profiles: Otter.ai — learns recurring speakers, improves over time
- • Best developer API: AssemblyAI — $0.17/hr, 2.9% speaker count error rate
- • Open-source standard: pyannote 3.1 — free, DER 11–19%, full pipeline control
- • Speaker count matters: 93–97% accuracy with 2 speakers degrades to 70–85% with 15+
Contents
Quick Picks by Use Case
| Use Case | Tool | Price | Why |
|---|---|---|---|
| Cheapest auto-diarization | VexaScribe | $2–$20/mo | Diarization on ALL plans starting $2/mo — no tier-gating, 99 languages |
| Highest benchmark accuracy | Fireflies.ai | Free/$10–$39/mo | 92.8% accuracy (7.2% DER) across 500+ hours of testing |
| Voice profiles (learn speakers) | Otter.ai | Free/$8.33–$19.99/mo | Learns your contacts’ voices over time, real-time diarization |
| 100% accuracy (no AI errors) | Riverside | Free/$24–$79/mo | Separate track recording — 0% DER by design |
| Developer API | AssemblyAI | $0.17/hr | 2.9% speaker count error rate, best API accuracy |
| Open-source self-hosted | pyannote 3.1 | Free (GPU req.) | Gold standard, DER 11–19%, full pipeline control |
| 50+ speakers in one recording | Fireflies.ai | Free/$10–$39/mo | Supports up to 50 speakers per recording |
| Human-perfect attribution | Rev | $25.49–$47.99/mo | Human transcription option at $1.99/min for critical recordings |
20 tools evaluated across consumer apps (8), developer APIs (8), and open-source (4). Pricing and competitor models verified June 6, 2026.
What Is Speaker Diarization?
Speaker diarization answers the question “who spoke when?” in an audio recording with multiple speakers. It's the technology that labels each segment as Speaker 1, Speaker 2, etc. — turning a single stream of words into a structured multi-speaker transcript.
Diarization (Unsupervised)
Assigns generic labels: Speaker 1, Speaker 2, Speaker 3. Does not know WHO the speakers are — only that they are different people. No prior voice data needed.
Identification (Supervised)
Maps voices to specific known people: “John Smith”, “Sarah Chen”. Requires prior voice enrollment or voice profiles. Otter.ai and OpenAI's API support this.
DER (Diarization Error Rate) — The Standard Metric
DER measures diarization accuracy as a single percentage. Lower is better.
- • False alarm: Silence labeled as speech
- • Missed speech: Speech labeled as silence
- • Speaker confusion: Speech attributed to the wrong speaker
- • Below 10% DER is considered good for production use
The Diarization Pipeline
Modern diarization systems follow a multi-stage pipeline:
- Audio preprocessing — noise reduction, normalization
- Voice Activity Detection (VAD) — separate speech from silence
- Segmentation — split audio into speaker-homogeneous segments
- Speaker embedding — convert each segment into a voice fingerprint vector
- Clustering — group similar embeddings (= same speaker)
- Labeling — assign Speaker 1, Speaker 2, etc.
Diarization Architecture: Clustering vs End-to-End vs Hybrid
Diarization systems fall into three architectural families. Understanding which family a tool belongs to predicts its overlap handling, scalability, and accuracy ceiling.
Clustering-based
Extract speaker embeddings (x-vectors, ECAPA-TDNN) per segment, then cluster similar embeddings using agglomerative hierarchical clustering (AHC) or spectral clustering. The classical approach.
Examples: AWS Transcribe, Google Cloud STT (legacy), older Kaldi recipes.
Weakness: Cannot handle overlapping speech (one speaker per frame assumption).
End-to-End Neural (EEND)
A single neural network maps audio directly to per-speaker activity sequences. Handles overlap natively. NVIDIA Sortformer (2024) is the leading open-source implementation; MSDD (Multi-Scale Diarization Decoder) adds multi-scale clustering.
Examples: NVIDIA NeMo Sortformer, NVIDIA NeMo MSDD, research EEND variants.
Hybrid (Powerset)
pyannote.audio 3.1+ uses powerset segmentation — a single neural network predicts, per frame, which combination of speakers is active (including the empty set and overlap sets). Bredin 2023.
Examples: pyannote.audio 3.1+, WhisperX (pyannote-based).
Reference papers: pyannote.audio 2.1+ (Bredin 2023), EEND (Fujita et al. 2020), NVIDIA NeMo diarization models.
Standard Benchmarks: DIHARD III, AMI, VoxConverse, CallHome
Four datasets are considered the gold standard for diarization evaluation in 2026. State-of-the-art DER on each is the reference number to compare any tool or research paper against.
| Benchmark | Domain | pyannote 3.1 DER | SOTA DER (2026) | Source |
|---|---|---|---|---|
| DIHARD III | 11 diverse domains — meetings, child speech, clinical, restaurant, courtroom | 21.7% | ~16-22% | DIHARD III challenge |
| AMI (headset) | 100 hours of multi-party business meetings, ICSI/IDIAP | 18.8% | ~17-22% | AMI corpus |
| VoxConverse | Celebrity interviews and panels from YouTube, ~50 hours | 11.2% | ~5-11% | VoxConverse (Oxford VGG) |
| CallHome (English) | Conversational telephony speech (2-4 speakers), LDC | ~13.0% | ~10-14% | LDC CallHome |
pyannote 3.1 numbers: from Bredin 2023 (arXiv:2304.05300) — the official pyannote.audio 3.1 paper. SOTA ranges reflect the best published results across the academic literature as of June 2026, including NVIDIA Sortformer + MSDD ensembles, Microsoft Whisper-Diarizer variants, and ESPnet diarization recipes.
DER Benchmark Comparison
We compiled DER benchmarks from independent testing and vendor-reported data. Lower DER = better accuracy. Riverside achieves 0% DER by recording each speaker on a separate track (not AI diarization).
| Tool | Overall DER | 2–4 Speakers | 5–8 Speakers | 9–15 Speakers | Noisy Audio | Source |
|---|---|---|---|---|---|---|
| Fireflies.ai | 7.2% | 4.9% | 7.1% | 10.2% | 9.3% | SummarizeMeeting 2026 |
| Notta | 8.5% | 6.8% | 8.9% | 11.1% | 10.9% | SummarizeMeeting 2026 |
| Otter.ai | 10.7% | 7.9% | 10.7% | 14.2% | 15.3% | SummarizeMeeting 2026 |
| pyannote 3.1 | 11–19% | Varies | Varies | Varies | Varies | pyannoteAI benchmark |
| AssemblyAI | ~10% (est.) | N/A | N/A | N/A | N/A | AssemblyAI blog |
| Riverside | 0% | 0% | 0% | 0% | 0% | Separate tracks |
Key insight: Fireflies leads AI-based diarization at 7.2% DER. Riverside's 0% DER is not AI diarization — it records each participant on a separate audio track, eliminating speaker confusion entirely. This only works for remote calls recorded through Riverside.
Overlapping Speech & Real-Time vs Batch
Two technical dimensions matter more for production diarization than raw DER on read speech: how the system handles overlapping speech (the #1 failure mode), and whether it runs batch (offline) or real-time (streaming).
Overlap-Aware vs Clustering-Only Diarization
| Tool / Library | Architecture | Overlap-aware? | Real-time? |
|---|---|---|---|
| pyannote.audio 3.1+ | Hybrid (powerset segmentation) | Yes | Batch only |
| NVIDIA Sortformer | End-to-end neural (EEND) | Yes | Batch only |
| NVIDIA NeMo MSDD | Multi-scale clustering + EEND | Yes | Batch only |
| WhisperX | pyannote-based | Yes (via pyannote) | Batch only |
| diart | Streaming pyannote | Yes | Yes |
| AssemblyAI Universal-2 | Proprietary (likely EEND-based) | Yes | Yes |
| Deepgram Nova-3 | Proprietary | Yes | Yes (lowest latency) |
| Speechmatics Ursa | Proprietary | Yes | Yes |
| Soniox | Proprietary multilingual | Yes | Yes |
| Otter.ai | Proprietary clustering | Limited | Yes |
| AWS Transcribe | Clustering-based | No | Streaming (limited diarization) |
| Google Cloud STT (Chirp 2) | Clustering-based | No | Streaming (limited) |
Overlap-aware vs clustering-only impact: On meeting-like audio with frequent overlap (AMI corpus), overlap-aware systems (pyannote 3.1, NVIDIA Sortformer, AssemblyAI Universal-2) reduce DER by 3-7 percentage points versus clustering-only baselines. This is the single largest accuracy delta in modern diarization — pick an overlap-aware tool for any multi-party meeting or conversational audio.
Batch (Offline) Diarization
The algorithm sees the entire audio before deciding speaker labels. Allows global clustering, look-ahead reassignment, and the lowest achievable DER. Use when: processing finished recordings (interviews, podcasts, recorded meetings, depositions).
Examples: pyannote 3.1, NVIDIA Sortformer, WhisperX, AssemblyAI async, Deepgram async, AWS Transcribe.
Real-Time (Streaming) Diarization
Processes audio as it arrives with constant latency. Typically 5-15 percentage points worse DER than batch on the same content. Use when: live captions, live meeting notes, real-time speaker labels for sales calls.
Examples: diart (open-source), Otter.ai live, Deepgram Nova-3 streaming, AssemblyAI streaming, Soniox.
How Speaker Count Affects Accuracy
Diarization accuracy degrades as the number of speakers increases. More speakers means more potential for confusion, especially when voices overlap.
| Speaker Count | Typical Accuracy | Notes |
|---|---|---|
| 2 speakers | 93–97% | Most tools perform well |
| 3–4 speakers | 90–95% | Still reliable for meetings |
| 5–8 speakers | 85–92% | Noticeable degradation begins |
| 9–12 speakers | 80–88% | Significant errors, especially overlapping speech |
| 13–15+ speakers | 70–85% | Only Fireflies (50 max) and APIs handle this reliably |
93–97%
Accuracy with 2 speakers on clean audio
85–92%
Accuracy with 5–8 speakers
7.2%
Best AI DER (Fireflies overall)
0%
DER with separate tracks (Riverside)
Overlapping speech is the #1 failure mode. When two people talk at the same time, most diarization systems either attribute the segment to one speaker (missing the other) or create a false third speaker. Noisy environments compound the problem — Otter's DER jumps from 10.7% overall to 15.3% on noisy audio.
Category A: Consumer SaaS Tools (8 Tools)
These tools are designed for non-technical users — business professionals, journalists, researchers, and teams who need multi-speaker transcription without writing code.
VexaScribe — Cheapest Auto-Diarization (All Plans)
VexaScribe includes auto-diarization on every plan — no tier-gating. The $2/mo Starter plan (200 min) includes the same speaker separation as the $20/mo Business plan (6,000 min). Most competitors gate diarization behind higher tiers: Descript requires Creator+ ($16/mo), and Sonix speaker labels require Premium. At $2/mo, VexaScribe is the cheapest way to get multi-speaker transcription.
99 languages with diarization on all of them. Bulk upload 50 multi-speaker recordings at once — transcribe an entire conference, workshop series, or research interview archive in a single batch. Export with speaker labels to TXT, DOCX, SRT.
Pros:
- ✓ Diarization on ALL plans starting $2/mo — no tier-gating
- ✓ 99 languages with speaker labels
- ✓ Bulk upload 50 multi-speaker files at once
- ✓ AI summaries included
- ✓ Cheapest per-minute with diarization
Cons:
- ✗ No voice profiles (can't learn specific speakers)
- ✗ Accuracy not independently benchmarked
- ✗ No real-time diarization during meetings
- ✗ No mobile app
Fireflies.ai — Highest Benchmark Accuracy (92.8%)
Fireflies leads independent benchmarks at 92.8% overall accuracy (7.2% DER) across 500+ hours of testing by SummarizeMeeting in 2026. It handles up to 50 speakers in a single recording — far more than any competitor. The meeting bot joins Zoom, Google Meet, and Teams calls automatically.
DER by speaker count: 4.9% with 2–4 speakers, 7.1% with 5–8, 10.2% with 9–15. Noisy audio pushes DER to 9.3% — still the best in class. Free tier includes 800 min/mo storage with limited AI features.
Pros:
- ✓ Highest benchmark accuracy (92.8% / 7.2% DER)
- ✓ Up to 50 speakers per recording
- ✓ Automatic meeting bot for Zoom/Meet/Teams
- ✓ Free tier available
- ✓ AI-powered meeting summaries and action items
Cons:
- ✗ Meeting-focused — less suited for file uploads
- ✗ Free tier has limited AI features
- ✗ No voice profiles for speaker identification
- ✗ Higher cost than VexaScribe ($10–$39/mo vs $2–$20/mo)
Otter.ai — Voice Profiles for Recurring Speakers
Otter's unique advantage is voice profiles: it learns your contacts' voices over time and can label speakers by name, not just “Speaker 1.” This bridges the gap between diarization and identification — after a few meetings, Otter recognizes regular participants automatically.
10.7% overall DER in independent benchmarks — good but behind Fireflies (7.2%) and Notta (8.5%). Real-time diarization during live meetings. Struggles with noisy audio (15.3% DER). 300 min/mo free tier with 30-min per-conversation cap.
Pros:
- ✓ Voice profiles learn recurring speakers
- ✓ Real-time diarization during meetings
- ✓ 300 min/mo free tier
- ✓ Cross-transcript speaker search
Cons:
- ✗ Primarily English — weak multilingual support
- ✗ 15.3% DER on noisy audio
- ✗ 30-min cap on free tier conversations
- ✗ Higher cost than VexaScribe for same features minus voice profiles
Notta — 91.5% Accuracy, 104 Languages
Notta achieves 91.5% accuracy (8.5% DER) in independent benchmarks — second only to Fireflies. 104 languages with diarization, making it the widest language support among consumer tools with verified benchmark data. Especially strong for CJK (Chinese, Japanese, Korean) multi-speaker transcription.
Mobile app (iOS + Android) with recording + diarization. Chrome extension for web meetings. Free tier: 120 min/mo with a 3-min live recording cap — useful for testing but not practical for regular use.
Pros:
- ✓ 91.5% accuracy (8.5% DER) — second-best benchmarked
- ✓ 104 languages with diarization
- ✓ Strong CJK language support
- ✓ Mobile app with recording
Cons:
- ✗ 3-min live recording cap on free tier
- ✗ No voice profiles
- ✗ Higher cost than VexaScribe ($8.25/mo vs $2/mo)
- ✗ 11.1% DER with 9–15 speakers
Descript — Per-Track Speaker Separation
Descript's unique approach: it separates speakers into individual audio tracks that you can edit independently. Delete one speaker's “um”s without affecting others. This is different from labeling — it's actual audio separation. Requires Creator+ plan ($16/mo annual) or higher.
Pros:
- ✓ Per-track speaker separation (edit independently)
- ✓ Text-based editing paradigm
- ✓ Integrated video/podcast editor
Cons:
- ✗ Requires Creator+ ($16/mo) — not on free/Hobbyist
- ✗ DER not publicly benchmarked
- ✗ Expensive for transcription-only use ($1.60–$2.40/hr)
- ✗ No real-time diarization
Riverside — 100% Accuracy via Separate Tracks
Riverside takes a fundamentally different approach: instead of using AI to separate speakers after recording, it records each participant on a separate audio and video track from the start. This means 0% DER by design — there's no AI guessing involved. Each speaker's track is recorded locally on their device for studio quality.
The limitation: this only works for remote calls recorded through Riverside. You can't upload an existing recording and get perfect separation. Limited to 8–10 participants per session.
Pros:
- ✓ 0% DER — perfect speaker separation
- ✓ Studio-quality local recording per speaker
- ✓ Free tier available
- ✓ Built-in video + audio recording
Cons:
- ✗ Only for remote calls through Riverside (can't process existing files)
- ✗ Limited to 8–10 participants
- ✗ No multilingual transcription
- ✗ Expensive for transcription-only ($24–$79/mo)
Sonix — Speaker Labeling with API Access
Sonix offers speaker labeling with auto-detection and manual correction. 49+ languages. The Premium plan adds API access for automated workflows. Pay-as-you-go at $10/hr means no monthly commitment — but it's expensive for heavy use compared to subscription tools.
Pros:
- ✓ No subscription required (PAYG)
- ✓ API access on Premium
- ✓ 49+ languages
- ✓ SOC 2 compliance
Cons:
- ✗ $10/hr is 17–50x more expensive than VexaScribe per hour
- ✗ DER not publicly benchmarked
- ✗ No real-time diarization
- ✗ Speaker labels require Premium tier
Rev — Human Transcription for Perfect Attribution
Rev's unique value: human transcription with trained transcriptionists who identify speakers with near-perfect accuracy. At $1.99/min ($119.40/hr), it's expensive — but for legal depositions, published research, or broadcast media, the accuracy justifies the cost. AI plans ($25.49–$47.99/mo) include automated diarization.
Pros:
- ✓ Human transcription option (~0% error)
- ✓ Unlimited speakers with human service
- ✓ Legal/broadcast compliance
- ✓ Free tier for AI transcription (45 min)
Cons:
- ✗ Human transcription is $119.40/hr — 60–600x more than AI tools
- ✗ AI diarization accuracy not independently benchmarked
- ✗ Slower turnaround for human service (12–24 hrs)
- ✗ Limited language support for human transcription
Category B: Developer APIs (8 Tools)
For developers building speech applications, these APIs provide diarization as a feature within their transcription pipeline. Pricing is per audio hour processed. The 2024-2026 model generation (AssemblyAI Universal-2, Deepgram Nova-3, Speechmatics Ursa, Soniox) leads on accuracy and lowest cost.
| API | Price/hr | Diarization Surcharge | Max Speakers | Languages | Voice Profiles |
|---|---|---|---|---|---|
| AssemblyAI Universal-2 | $0.36/hr | Included | 30 | 16–99 | ✓ (add-on) |
| OpenAI | $0.36/hr | Included | Auto | 99+ | ✓ (4 refs) |
| Deepgram Nova-3 | $0.26/hr | Included | 16+ | 36+ | ✗ |
| Speechmatics Ursa | $1.50/hr | Included | Auto | 50+ | ✗ |
| Rev AI (developer) | $1.20/hr | Included | 10+ | 36+ | ✗ |
| Soniox | ~$0.25/hr | Included | Auto | 60+ | ✗ |
| Google Cloud STT | $1.44–$2.16/hr | Extra (enhanced model) | Auto | 125+ | ✗ |
| AWS Transcribe | $1.74–$2.04/hr | Included | 30 | 100+ | ✗ |
AssemblyAI Universal-2 — Best Accuracy + LLM Features ($0.36/hr)
Universal-2 is AssemblyAI's 2024-generation model. ~7-10% WER on Open ASR Leaderboard composite, with overlap-aware diarization. AssemblyAI reports a 2.9% speaker count error rate — meaning it correctly identifies the number of speakers 97.1% of the time. Supports up to 30 speakers. LeMUR integration adds LLM-powered summarization, sentiment, custom topics, and PII redaction in the same API. Voice profiles available as an add-on for speaker identification.
Latency: ~15–30% of audio duration for async processing. Real-time streaming available with diarization. Universal-2 release notes.
Deepgram Nova-3 — Lowest Cost + Fastest Streaming ($0.26/hr)
Nova-3 (late 2024) is Deepgram's state-of-the-art model. ~7-10% WER on Open ASR Leaderboard. At $0.0043/min ($0.26/hr) for async, it's the cheapest hosted diarization API in 2026. Lowest streaming latency in the category — preferred for real-time meeting transcription, contact centers, and live captioning at scale. Supports 16+ speakers per recording, 36+ languages.
Latency: ~10–20% of audio duration async; sub-300ms streaming. Nova-3 launch notes.
Speechmatics Ursa — Best Accent & Multilingual Diarization ($1.50/hr)
Ursa (2025) is Speechmatics' latest model, known for the strongest accent and dialect robustness in the category — particularly on Indian English, African English, and code-switching audio where most competitors degrade significantly. Includes overlap-aware diarization. 50+ languages with high accuracy across the long tail. Both batch and streaming endpoints. Used by broadcast and media companies for hard-to-transcribe interview content.
Latency: ~20–40% of audio duration async; streaming available. Speechmatics pricing.
Rev AI (Developer) — Async + Streaming with Diarization ($1.20/hr)
Rev's developer-facing API, distinct from the consumer Rev product. At $0.02/min ($1.20/hr) async, more expensive than AssemblyAI or Deepgram but with strong English accuracy and an optional upgrade path to Rev's human transcription service ($1.99/min) on the same platform. Useful for production pipelines that need a human-verified fallback for high-stakes audio. Diarization included; 36+ languages.
Latency: ~10–25% of audio duration async; streaming available. Rev AI pricing.
Soniox — Real-Time Multilingual with Auto Language Detection (~$0.25/hr)
Soniox specializes in real-time multilingual transcription with automatic language detection mid-stream — useful for code-switching audio (Spanglish, Hinglish) and multilingual meetings. Includes overlap-aware diarization. Per-second pricing model from ~$0.0042/min. 60+ languages. Strong fit for global customer support, multilingual meeting bots, and real-time captioning where the spoken language changes within a single session.
Latency: Sub-500ms streaming. Soniox pricing.
OpenAI gpt-4o-transcribe — Newest Entry ($0.36/hr)
OpenAI's gpt-4o-transcribe model added built-in speaker diarization labels. At $0.36/hr, it sits between AssemblyAI and Deepgram on price. Supports 99+ languages via Whisper backbone. Unique feature: 4 reference audio clips for speaker identification — provide sample audio of known speakers to get labeled output.
Note: DER not independently benchmarked yet. Early reports suggest competitive accuracy with 2–4 speakers.
AWS Transcribe — Enterprise-Grade ($1.74–$2.04/hr)
AWS Transcribe supports up to 30 speakers with diarization included in the standard pricing. 100+ languages. Integrates with the broader AWS ecosystem (S3, Lambda, SageMaker). Best for enterprises already on AWS who need diarization as part of a larger pipeline. Custom vocabulary and custom language models available.
Pricing: ~$1.44/hr standard, $0.30/hr surcharge for enhanced model with diarization features. Total ~$1.74–$2.04/hr depending on region and model.
Google Cloud Speech-to-Text — 125+ Languages ($1.44–$2.16/hr)
Google Cloud offers diarization through the enhanced model at $1.44–$2.16/hr depending on features and region. Widest language support at 125+. Auto speaker count detection. Integrates with Google Cloud ecosystem (BigQuery, Vertex AI). Speaker diarization requires the enhanced model — the standard model does not support it.
Best for: Enterprises on GCP needing diarization across many languages with cloud-native integration.
Category C: Open-Source (pyannote, NeMo, WhisperX, diart)
pyannote.audio 3.1 — Gold Standard Open-Source Diarization
pyannote.audio 3.1 is the de facto standard for open-source speaker diarization. DER ranges from 11% (clean, 2 speakers) to 19% (noisy, many speakers) on standard benchmarks. Language-agnostic — works on any language without language-specific models. Commonly paired with OpenAI Whisper for a complete open-source transcription + diarization pipeline.
GPU required for practical use (CPU inference is 10–50x slower). The commercial pyannoteAI service (€19–€99/mo) offers a 28% DER improvement over the open-source version with proprietary model weights.
Pros:
- ✓ Free and open-source (MIT license)
- ✓ Full pipeline control — customize every stage
- ✓ Language-agnostic (works on any language)
- ✓ Configurable max speakers
- ✓ Active community and research papers
- ✓ Pairs with Whisper for end-to-end pipeline
Cons:
- ✗ GPU required (NVIDIA recommended, 4GB+ VRAM)
- ✗ DER 11–19% — worse than best commercial tools (7.2%)
- ✗ Requires Python development skills
- ✗ No UI — command-line/code only
- ✗ Self-hosted infrastructure costs
Open-Source vs. Commercial Accuracy Gap
pyannoteAI (commercial) achieves 28% lower DER than pyannote 3.1 (open-source) on the same benchmarks. If you need the best open-source accuracy without paying for the commercial version, fine-tuning on your specific domain data can close most of the gap.
NVIDIA NeMo: Sortformer & MSDD — End-to-End Neural Diarization
NVIDIA NeMo's diarization stack covers two architectures: Sortformer (2024) is an end-to-end neural diarizer that produces speaker activity sequences directly from audio, handling overlap natively. MSDD (Multi-Scale Diarization Decoder) combines neural segmentation with multi-scale clustering for longer recordings. Both are competitive with pyannote 3.1 on AMI and DIHARD III benchmarks, especially in overlap-heavy conditions.
When to pick over pyannote: You already use NeMo for ASR (Canary, Parakeet, Conformer), you have NVIDIA hardware, or your audio has heavy overlap (meetings, panels). Apache 2.0 license is more permissive than pyannote's MIT for some downstream packaging scenarios.
WhisperX — Whisper + pyannote + Forced Alignment
WhisperX wraps three open-source components into one CLI: OpenAI Whisper Large-v3 for transcription, pyannote.audio 3.1 for diarization (overlap-aware via powerset), and wav2vec2-based forced alignment for word-level timestamps. The single most popular open-source "transcription + speaker labels" combo in 2026.
When to pick: You want both a transcript and speaker labels in one batch command without writing pipeline glue code. Inherits pyannote 3.1's 11-19% DER on standard benchmarks plus Whisper's 2.7% WER on LibriSpeech.
diart — Real-Time Streaming Diarization
diart is the standard open-source streaming diarization toolkit. Built on pyannote models, it processes audio in chunks with constant latency, suitable for live transcription and real-time speaker labels. Pairs naturally with streaming Whisper variants (whisper-streaming, faster-whisper) for full live transcription + diarization.
When to pick: You need real-time diarization (live captions, meeting bots, accessibility tools) and don't want to pay per-minute API costs at scale. Expect 5-15 percentage points worse DER than offline pyannote due to streaming constraints.
Other Open-Source Options:
- • Kaldi: Legacy toolkit; mature x-vector + AHC pipelines, but largely superseded by pyannote and NeMo for new projects.
- • SpeechBrain: PyTorch-based all-in-one speech toolkit; includes diarization recipes but smaller community than pyannote/NeMo.
- • ESPnet: Research-focused speech toolkit; has EEND and target-speaker EEND recipes used in academic papers.
- • Picovoice Falcon: On-device diarization for edge applications; limited to 2 speakers, narrow scope.
Full Comparison: All 20 Tools
| Tool | Category | Cost/hr | Max Speakers | Languages | DER (approx) | Voice Profiles |
|---|---|---|---|---|---|---|
| VexaScribe | Consumer | $0.20–$0.60 | Auto | 99 | Not benchmarked | ✗ |
| Fireflies.ai | Consumer | $0.60–$1.08 | 50 | 100+ | ~7.2% | ✗ |
| Otter.ai | Consumer | $0.42–$0.85 | Auto | 30+ | ~10.7% | ✓ |
| Notta | Consumer | $0.50–$0.93 | Auto | 104 | ~8.5% | ✗ |
| Descript | Consumer | $1.60–$2.40 | 8+ | 30+ | Not published | ✗ |
| Riverside | Consumer | $0.96–$1.16 | 8–10 | N/A | 0% | N/A |
| Sonix | Consumer | $5–$10/hr | Auto | 49+ | Not published | ✗ |
| Rev Human | Consumer | $119.40/hr | Unlimited | 15+ | ~0% | Human |
| AssemblyAI | API | $0.17/hr | 30 | 16–99 | ~10% (est.) | ✓ (add-on) |
| Deepgram | API | $0.58/hr | 16+ | 45+ | Not published | ✗ |
| OpenAI | API | $0.36/hr | Auto | 99+ | Not published | ✓ (4 refs) |
| AWS Transcribe | API | $1.74–$2.04/hr | 30 | 100+ | Not published | ✗ |
| Google Cloud | API | $1.44–$2.16/hr | Auto | 125+ | Not published | ✗ |
| pyannote 3.1 | Open-source | Free (GPU costs) | Configurable | Agnostic | 11–19% | ✗ |
Legend: ✓ = Supported | ✗ = Not supported. Cost/hr calculated from cheapest plan with diarization. All pricing verified June 6, 2026.
How We Tested Speaker Diarization Tools
We evaluated each tool on diarization-specific criteria. DER benchmarks come from independent testing (SummarizeMeeting 2026) and vendor-reported data where independent results were unavailable. See our multi-speaker transcription comparison for additional testing methodology.
Test Recordings:
| Test | Duration | Details |
|---|---|---|
| 2-Speaker Interview | 42 min | Clear audio, minimal overlap, recorded on Zoom |
| 5-Speaker Meeting | 58 min | Team standup with frequent turn-taking, Google Meet |
| 8-Speaker Panel | 90 min | Conference panel with overlapping speech, audience noise |
| Noisy Environment | 30 min | 3 speakers in a café with background noise |
What We Measured:
- • DER (Diarization Error Rate) — false alarms + missed speech + speaker confusion
- • Speaker count accuracy — how often the tool correctly identifies the number of speakers
- • Overlap handling — behavior when two speakers talk simultaneously
- • Latency — time from upload to completed diarization
- • Speaker label consistency — does Speaker 1 stay Speaker 1 throughout the recording?
Pricing sources: Each tool's official pricing page, verified June 6, 2026. API pricing reflects standard tier without volume discounts. Benchmark sources include the Hugging Face Open ASR Leaderboard, Bredin 2023 (pyannote 3.1, arXiv:2304.05300), DIHARD III challenge results, AMI corpus published baselines, and VoxConverse benchmark numbers.
Frequently Asked Questions
What is speaker diarization?
Speaker diarization is AI technology that identifies “who spoke when” in audio with multiple speakers. It labels each segment as Speaker 1, Speaker 2, etc. Unlike simple transcription, diarization separates overlapping conversations and attributes each word to the correct person.
How accurate is speaker diarization?
90–97% with 2–4 speakers on clean audio. Degrades to 85–90% with 5–8 speakers. Fireflies leads benchmarks at 92.8% overall accuracy (7.2% DER). Overlapping speech is the biggest challenge for all tools.
What’s the difference between diarization and speaker identification?
Diarization assigns generic labels (Speaker 1, Speaker 2) without knowing who the speakers are — it’s unsupervised. Identification matches voices to specific known people (requires prior voice enrollment). Otter.ai and OpenAI’s API support identification via voice profiles.
How many speakers can diarization handle?
Most consumer tools handle 2–10 reliably. Fireflies claims up to 50. AssemblyAI API supports up to 30. Accuracy decreases as speaker count increases — expect 85–90% with 5–8 speakers and 80–85% with 9–15.
What is DER (Diarization Error Rate)?
DER is the standard accuracy metric for speaker diarization. It measures false alarms + missed speech + speaker confusion as a percentage of total speech duration. Below 10% is considered good. Fireflies achieves 7.2% DER, Notta 8.5%, and Otter 10.7%.
Does diarization work with overlapping speech?
Poorly. Overlapping speech is the #1 failure mode for all diarization systems. Modern tools are improving but still struggle when two people talk simultaneously. Riverside avoids the problem entirely by recording each speaker on a separate audio track.
Which is the cheapest tool with speaker diarization?
VexaScribe at $2/month includes auto-diarization on all plans — no tier-gating. Next cheapest: Fireflies free tier (limited minutes) and Otter free (300 min/mo). For APIs, AssemblyAI at $0.17/hr is the cheapest developer option with diarization.
Can I get 100% accurate speaker separation?
Yes — record each speaker on a separate audio track. Riverside does this automatically for remote calls. Alternatively, use Rev’s human transcription ($1.99/min) for near-perfect speaker attribution by trained transcriptionists.
What is the standard benchmark for speaker diarization?
Three benchmarks are considered the gold standard for diarization in 2026. (1) DIHARD III (Third DIHARD Speech Diarization Challenge, 2020) — the toughest benchmark, covering 11 diverse domains including child speech and clinical conversations. State-of-the-art DER is ~16-22%. (2) AMI (Augmented Multi-party Interaction) meeting corpus — 100 hours of multi-speaker meeting audio. SOTA DER ~17-22% on headset condition. (3) VoxConverse — celebrity interviews and panels from YouTube, ~50 hours. SOTA DER ~5-11%. pyannote.audio 3.1 reports 21.7% DER on DIHARD III, 18.8% on AMI, and 11.2% on VoxConverse (Bredin 2023). CallHome is also widely used for conversational telephony.
What are the best open-source diarization tools in 2026?
Four leading open-source options. (1) pyannote.audio 3.1+ — the gold standard, includes powerset segmentation for overlap handling. Used by VexaScribe, WhisperX, and many commercial tools under the hood. (2) NVIDIA NeMo’s Sortformer (end-to-end overlap-aware) and MSDD (Multi-Scale Diarization Decoder) — competitive with pyannote on AMI, GPU-friendly. (3) WhisperX — combines Whisper Large-v3 transcription with pyannote diarization and forced alignment in one pipeline. Most popular combo for self-hosted transcription with speaker labels. (4) diart — real-time streaming diarization built on pyannote, the standard for live diarization.
How does overlapping speech affect diarization accuracy?
Overlapping speech is the #1 challenge for diarization. Traditional clustering-based systems (x-vector + agglomerative hierarchical clustering) cannot handle overlap — they assume one speaker per frame. Modern overlap-aware systems use either powerset encoding (pyannote 3.1+) or end-to-end neural diarization (EEND, NVIDIA Sortformer) to assign multiple speakers to the same frame. On real meeting audio (AMI), overlap-aware systems reduce DER by 3-7 percentage points vs clustering-only baselines. Commercial APIs vary: AssemblyAI Universal-2 and Deepgram Nova-3 have overlap handling; most others (including AWS Transcribe, Google Cloud STT) are clustering-based and degrade on overlap.
What is the difference between batch and real-time diarization?
Batch diarization processes a complete recording offline — the algorithm sees the entire audio before deciding speaker labels, allowing global clustering and reassignment. Most diarization tools (pyannote, NeMo Sortformer, AWS Transcribe, Google STT, AssemblyAI async) are batch-only. Real-time (streaming) diarization processes audio as it arrives, with constant latency budgets and no future context. This is significantly harder — typical streaming DER is 5-15 percentage points worse than the batch equivalent. The leading real-time options in 2026: diart (open-source, pyannote-based), Otter.ai (consumer real-time), Deepgram Nova-3 streaming, and Soniox.
Related Resources
Ready to Transcribe Multi-Speaker Audio?
Start with 30 free minutes. Auto-diarization included on every plan — no tier-gating. From $2/mo.