By NovaScribe Editorial · Pricing verified April 2026

Best Speaker Diarization Tools in 2026 (Apps + APIs)

We compared 14 speaker diarization tools across three categories: consumer apps for non-technical users, developer APIs for building speech applications, and open-source libraries for self-hosted solutions. Fireflies.ai has the highest benchmark accuracy at 92.8% across 500+ hours of testing. NovaScribe is the cheapest tool with auto-diarization on every plan ($2/month). Riverside gives 100% accuracy by recording each speaker on a separate track. For developers, AssemblyAI offers the best diarization API with a 2.9% speaker count error rate.

Quick Decision Rule:

  • Cheapest auto-diarization → NovaScribe ($2/mo, all plans)
  • Highest benchmark accuracy → Fireflies (92.8%) or Riverside (100% via separate tracks)
  • Voice profiles for recurring speakers → Otter.ai (learns your contacts)
  • Developer API → AssemblyAI ($0.17/hr) or Deepgram ($0.58/hr)
  • Open-source self-hosted → pyannote 3.1 (free, GPU required)
  • Perfect accuracy (no AI errors) → Riverside (separate tracks) or Rev Human ($1.99/min)
  • 50+ speakers in one recording → Fireflies or Deepgram

Disclosure: NovaScribe is our product. We recommend it for users who need affordable multi-speaker transcription without tier-gating — diarization is included on all plans starting at $2/mo. We acknowledge Fireflies has higher benchmark accuracy (92.8% vs. our unverified accuracy), Otter.ai has voice profiles we don't offer, and Riverside provides 100% accuracy via separate tracks. Pricing verified on official sites April 2, 2026.

Key Takeaways

  • Cheapest diarization: NovaScribe — $2/mo with auto-diarization on every plan (no tier-gating)
  • Highest benchmark accuracy: Fireflies.ai — 92.8% overall (7.2% DER) across 500+ hours
  • Perfect accuracy: Riverside — 0% DER via separate track recording (remote calls only)
  • Best voice profiles: Otter.ai — learns recurring speakers, improves over time
  • Best developer API: AssemblyAI — $0.17/hr, 2.9% speaker count error rate
  • Open-source standard: pyannote 3.1 — free, DER 11–19%, full pipeline control
  • Speaker count matters: 93–97% accuracy with 2 speakers degrades to 70–85% with 15+

Quick Picks by Use Case

Use CaseToolPriceWhy
Cheapest auto-diarizationNovaScribe$2–$20/moDiarization on ALL plans starting $2/mo — no tier-gating, 99 languages
Highest benchmark accuracyFireflies.aiFree/$10–$39/mo92.8% accuracy (7.2% DER) across 500+ hours of testing
Voice profiles (learn speakers)Otter.aiFree/$8.33–$19.99/moLearns your contacts’ voices over time, real-time diarization
100% accuracy (no AI errors)RiversideFree/$24–$79/moSeparate track recording — 0% DER by design
Developer APIAssemblyAI$0.17/hr2.9% speaker count error rate, best API accuracy
Open-source self-hostedpyannote 3.1Free (GPU req.)Gold standard, DER 11–19%, full pipeline control
50+ speakers in one recordingFireflies.aiFree/$10–$39/moSupports up to 50 speakers per recording
Human-perfect attributionRev$25.49–$47.99/moHuman transcription option at $1.99/min for critical recordings

14 tools evaluated across consumer apps, developer APIs, and open-source. Pricing verified April 2026.

What Is Speaker Diarization?

Speaker diarization answers the question “who spoke when?” in an audio recording with multiple speakers. It's the technology that labels each segment as Speaker 1, Speaker 2, etc. — turning a single stream of words into a structured multi-speaker transcript.

Diarization (Unsupervised)

Assigns generic labels: Speaker 1, Speaker 2, Speaker 3. Does not know WHO the speakers are — only that they are different people. No prior voice data needed.

Identification (Supervised)

Maps voices to specific known people: “John Smith”, “Sarah Chen”. Requires prior voice enrollment or voice profiles. Otter.ai and OpenAI's API support this.

DER (Diarization Error Rate) — The Standard Metric

DER measures diarization accuracy as a single percentage. Lower is better.

DER = (False Alarm + Missed Speech + Speaker Confusion) ÷ Total Speech Duration
  • False alarm: Silence labeled as speech
  • Missed speech: Speech labeled as silence
  • Speaker confusion: Speech attributed to the wrong speaker
  • Below 10% DER is considered good for production use

The Diarization Pipeline

Modern diarization systems follow a multi-stage pipeline:

  1. Audio preprocessing — noise reduction, normalization
  2. Voice Activity Detection (VAD) — separate speech from silence
  3. Segmentation — split audio into speaker-homogeneous segments
  4. Speaker embedding — convert each segment into a voice fingerprint vector
  5. Clustering — group similar embeddings (= same speaker)
  6. Labeling — assign Speaker 1, Speaker 2, etc.

DER Benchmark Comparison

We compiled DER benchmarks from independent testing and vendor-reported data. Lower DER = better accuracy. Riverside achieves 0% DER by recording each speaker on a separate track (not AI diarization).

ToolOverall DER2–4 Speakers5–8 Speakers9–15 SpeakersNoisy AudioSource
Fireflies.ai7.2%4.9%7.1%10.2%9.3%SummarizeMeeting 2026
Notta8.5%6.8%8.9%11.1%10.9%SummarizeMeeting 2026
Otter.ai10.7%7.9%10.7%14.2%15.3%SummarizeMeeting 2026
pyannote 3.111–19%VariesVariesVariesVariespyannoteAI benchmark
AssemblyAI~10% (est.)N/AN/AN/AN/AAssemblyAI blog
Riverside0%0%0%0%0%Separate tracks

Key insight: Fireflies leads AI-based diarization at 7.2% DER. Riverside's 0% DER is not AI diarization — it records each participant on a separate audio track, eliminating speaker confusion entirely. This only works for remote calls recorded through Riverside.

How Speaker Count Affects Accuracy

Diarization accuracy degrades as the number of speakers increases. More speakers means more potential for confusion, especially when voices overlap.

Speaker CountTypical AccuracyNotes
2 speakers93–97%Most tools perform well
3–4 speakers90–95%Still reliable for meetings
5–8 speakers85–92%Noticeable degradation begins
9–12 speakers80–88%Significant errors, especially overlapping speech
13–15+ speakers70–85%Only Fireflies (50 max) and APIs handle this reliably

93–97%

Accuracy with 2 speakers on clean audio

85–92%

Accuracy with 5–8 speakers

7.2%

Best AI DER (Fireflies overall)

0%

DER with separate tracks (Riverside)

Overlapping speech is the #1 failure mode. When two people talk at the same time, most diarization systems either attribute the segment to one speaker (missing the other) or create a false third speaker. Noisy environments compound the problem — Otter's DER jumps from 10.7% overall to 15.3% on noisy audio.

Category A: Consumer SaaS Tools (8 Tools)

These tools are designed for non-technical users — business professionals, journalists, researchers, and teams who need multi-speaker transcription without writing code.

NovaScribe — Cheapest Auto-Diarization (All Plans)

Best for: Affordable multi-speaker transcription
Price: $2–$20/mo
Languages: 99 | Max speakers: Auto-detect
Pricing source: novascribe.ai/pricing (verified Apr 2, 2026)

NovaScribe includes auto-diarization on every plan — no tier-gating. The $2/mo Starter plan (200 min) includes the same speaker separation as the $20/mo Business plan (6,000 min). Most competitors gate diarization behind higher tiers: Descript requires Creator+ ($16/mo), and Sonix speaker labels require Premium. At $2/mo, NovaScribe is the cheapest way to get multi-speaker transcription.

99 languages with diarization on all of them. Bulk upload 50 multi-speaker recordings at once — transcribe an entire conference, workshop series, or research interview archive in a single batch. Export with speaker labels to TXT, DOCX, SRT.

Pros:

  • ✓ Diarization on ALL plans starting $2/mo — no tier-gating
  • ✓ 99 languages with speaker labels
  • ✓ Bulk upload 50 multi-speaker files at once
  • ✓ AI summaries included
  • ✓ Cheapest per-minute with diarization

Cons:

  • ✗ No voice profiles (can't learn specific speakers)
  • ✗ Accuracy not independently benchmarked
  • ✗ No real-time diarization during meetings
  • ✗ No mobile app
Choose if: You need affordable multi-speaker transcription and don't need voice enrollment or real-time diarization. Best value for batch processing multi-speaker recordings.
Try NovaScribe free (30 minutes) →

Fireflies.ai — Highest Benchmark Accuracy (92.8%)

Best for: Teams needing the most accurate diarization
Price: Free / $10–$39/mo
Languages: 100+ | Max speakers: 50
Pricing source: fireflies.ai/pricing (verified Apr 2, 2026)

Fireflies leads independent benchmarks at 92.8% overall accuracy (7.2% DER) across 500+ hours of testing by SummarizeMeeting in 2026. It handles up to 50 speakers in a single recording — far more than any competitor. The meeting bot joins Zoom, Google Meet, and Teams calls automatically.

DER by speaker count: 4.9% with 2–4 speakers, 7.1% with 5–8, 10.2% with 9–15. Noisy audio pushes DER to 9.3% — still the best in class. Free tier includes 800 min/mo storage with limited AI features.

Pros:

  • ✓ Highest benchmark accuracy (92.8% / 7.2% DER)
  • ✓ Up to 50 speakers per recording
  • ✓ Automatic meeting bot for Zoom/Meet/Teams
  • ✓ Free tier available
  • ✓ AI-powered meeting summaries and action items

Cons:

  • ✗ Meeting-focused — less suited for file uploads
  • ✗ Free tier has limited AI features
  • ✗ No voice profiles for speaker identification
  • ✗ Higher cost than NovaScribe ($10–$39/mo vs $2–$20/mo)
Choose if: Diarization accuracy is your top priority, especially for meetings with 5+ speakers. Best benchmark results across all speaker counts.

Otter.ai — Voice Profiles for Recurring Speakers

Best for: Teams with recurring meeting participants
Price: Free / $8.33–$19.99/mo (annual)
Languages: 30+ | Max speakers: Auto
Pricing source: otter.ai/pricing (verified Apr 2, 2026)

Otter's unique advantage is voice profiles: it learns your contacts' voices over time and can label speakers by name, not just “Speaker 1.” This bridges the gap between diarization and identification — after a few meetings, Otter recognizes regular participants automatically.

10.7% overall DER in independent benchmarks — good but behind Fireflies (7.2%) and Notta (8.5%). Real-time diarization during live meetings. Struggles with noisy audio (15.3% DER). 300 min/mo free tier with 30-min per-conversation cap.

Pros:

  • ✓ Voice profiles learn recurring speakers
  • ✓ Real-time diarization during meetings
  • ✓ 300 min/mo free tier
  • ✓ Cross-transcript speaker search

Cons:

  • ✗ Primarily English — weak multilingual support
  • ✗ 15.3% DER on noisy audio
  • ✗ 30-min cap on free tier conversations
  • ✗ Higher cost than NovaScribe for same features minus voice profiles
Choose if: You have recurring meetings with the same people and want automatic speaker identification by name. The voice profile feature is unique among consumer tools.

Notta — 91.5% Accuracy, 104 Languages

Best for: Multilingual multi-speaker transcription
Price: Free / $8.25–$27.99/mo (annual)
Languages: 104 | Max speakers: Auto
Pricing source: notta.ai/pricing (verified Apr 2, 2026)

Notta achieves 91.5% accuracy (8.5% DER) in independent benchmarks — second only to Fireflies. 104 languages with diarization, making it the widest language support among consumer tools with verified benchmark data. Especially strong for CJK (Chinese, Japanese, Korean) multi-speaker transcription.

Mobile app (iOS + Android) with recording + diarization. Chrome extension for web meetings. Free tier: 120 min/mo with a 3-min live recording cap — useful for testing but not practical for regular use.

Pros:

  • ✓ 91.5% accuracy (8.5% DER) — second-best benchmarked
  • ✓ 104 languages with diarization
  • ✓ Strong CJK language support
  • ✓ Mobile app with recording

Cons:

  • ✗ 3-min live recording cap on free tier
  • ✗ No voice profiles
  • ✗ Higher cost than NovaScribe ($8.25/mo vs $2/mo)
  • ✗ 11.1% DER with 9–15 speakers
Choose if: You need multi-speaker transcription in CJK languages or want benchmarked accuracy with wide language support.

Descript — Per-Track Speaker Separation

Best for: Video/podcast editors needing per-speaker tracks
Price: Free / $16–$50/mo (annual)
Languages: 30+ | Max speakers: 8+
Pricing source: descript.com/pricing (verified Apr 2, 2026)

Descript's unique approach: it separates speakers into individual audio tracks that you can edit independently. Delete one speaker's “um”s without affecting others. This is different from labeling — it's actual audio separation. Requires Creator+ plan ($16/mo annual) or higher.

Pros:

  • ✓ Per-track speaker separation (edit independently)
  • ✓ Text-based editing paradigm
  • ✓ Integrated video/podcast editor

Cons:

  • ✗ Requires Creator+ ($16/mo) — not on free/Hobbyist
  • ✗ DER not publicly benchmarked
  • ✗ Expensive for transcription-only use ($1.60–$2.40/hr)
  • ✗ No real-time diarization
Choose if: You edit podcasts or videos and need per-speaker audio tracks, not just labeled transcripts.

Riverside — 100% Accuracy via Separate Tracks

Best for: Remote recordings requiring perfect speaker separation
Price: Free / $24–$79/mo
Languages: N/A (separate tracks) | Max speakers: 8–10
Pricing source: riverside.fm/pricing (verified Apr 2, 2026)

Riverside takes a fundamentally different approach: instead of using AI to separate speakers after recording, it records each participant on a separate audio and video track from the start. This means 0% DER by design — there's no AI guessing involved. Each speaker's track is recorded locally on their device for studio quality.

The limitation: this only works for remote calls recorded through Riverside. You can't upload an existing recording and get perfect separation. Limited to 8–10 participants per session.

Pros:

  • ✓ 0% DER — perfect speaker separation
  • ✓ Studio-quality local recording per speaker
  • ✓ Free tier available
  • ✓ Built-in video + audio recording

Cons:

  • ✗ Only for remote calls through Riverside (can't process existing files)
  • ✗ Limited to 8–10 participants
  • ✗ No multilingual transcription
  • ✗ Expensive for transcription-only ($24–$79/mo)
Choose if: You record remote interviews/podcasts and need 100% perfect speaker separation. Not suitable for existing recordings or in-person meetings.

Sonix — Speaker Labeling with API Access

Best for: Pay-as-you-go diarization with API
Price: $10/hr PAYG
Languages: 49+ | Max speakers: Auto
Pricing source: sonix.ai/pricing (verified Apr 2, 2026)

Sonix offers speaker labeling with auto-detection and manual correction. 49+ languages. The Premium plan adds API access for automated workflows. Pay-as-you-go at $10/hr means no monthly commitment — but it's expensive for heavy use compared to subscription tools.

Pros:

  • ✓ No subscription required (PAYG)
  • ✓ API access on Premium
  • ✓ 49+ languages
  • ✓ SOC 2 compliance

Cons:

  • ✗ $10/hr is 17–50x more expensive than NovaScribe per hour
  • ✗ DER not publicly benchmarked
  • ✗ No real-time diarization
  • ✗ Speaker labels require Premium tier
Choose if: You need occasional multi-speaker transcription without a monthly commitment and want API access for automation.

Rev — Human Transcription for Perfect Attribution

Best for: Critical recordings requiring human-level accuracy
Price: Free / $25.49–$47.99/mo (AI) | $1.99/min (human)
Languages: 15+ (AI), limited (human) | Max speakers: Unlimited (human)
Pricing source: rev.com/pricing (verified Apr 2, 2026)

Rev's unique value: human transcription with trained transcriptionists who identify speakers with near-perfect accuracy. At $1.99/min ($119.40/hr), it's expensive — but for legal depositions, published research, or broadcast media, the accuracy justifies the cost. AI plans ($25.49–$47.99/mo) include automated diarization.

Pros:

  • ✓ Human transcription option (~0% error)
  • ✓ Unlimited speakers with human service
  • ✓ Legal/broadcast compliance
  • ✓ Free tier for AI transcription (45 min)

Cons:

  • ✗ Human transcription is $119.40/hr — 60–600x more than AI tools
  • ✗ AI diarization accuracy not independently benchmarked
  • ✗ Slower turnaround for human service (12–24 hrs)
  • ✗ Limited language support for human transcription
Choose if: You need guaranteed-perfect speaker attribution for legal, research, or broadcast content and budget allows $1.99/min.

Category B: Developer APIs (5 Tools)

For developers building speech applications, these APIs provide diarization as a feature within their transcription pipeline. Pricing is per audio hour processed.

APIPrice/hrDiarization SurchargeMax SpeakersLanguagesVoice Profiles
AssemblyAI$0.17/hrIncluded3016–99 (add-on)
OpenAI$0.36/hrIncludedAuto99+ (4 refs)
Deepgram$0.58/hrIncluded16+45+
Google Cloud STT$1.44–$2.16/hrExtra (enhanced model)Auto125+
AWS Transcribe$1.74–$2.04/hrIncluded30100+

AssemblyAI — Best Diarization API ($0.17/hr)

AssemblyAI reports a 2.9% speaker count error rate — meaning it correctly identifies the number of speakers 97.1% of the time. At $0.17/hr with diarization included (no surcharge), it's the cheapest API option. Supports up to 30 speakers. Voice profiles available as an add-on for speaker identification. Universal-2 model supports 16 languages; Best model routes to language-specific models covering 99 languages.

Latency: ~15–30% of audio duration for async processing. Real-time streaming available with diarization.

Deepgram — Language-Agnostic Diarization ($0.58/hr)

Deepgram's diarization works across 45+ languages without language-specific tuning. Supports 16+ speakers per recording. The Nova-2 model includes diarization at no additional cost. Good for applications needing multilingual speaker separation at scale.

Latency: ~10–20% of audio duration. Streaming diarization available.

OpenAI gpt-4o-transcribe — Newest Entry ($0.36/hr)

OpenAI's gpt-4o-transcribe model added built-in speaker diarization labels. At $0.36/hr, it sits between AssemblyAI and Deepgram on price. Supports 99+ languages via Whisper backbone. Unique feature: 4 reference audio clips for speaker identification — provide sample audio of known speakers to get labeled output.

Note: DER not independently benchmarked yet. Early reports suggest competitive accuracy with 2–4 speakers.

AWS Transcribe — Enterprise-Grade ($1.74–$2.04/hr)

AWS Transcribe supports up to 30 speakers with diarization included in the standard pricing. 100+ languages. Integrates with the broader AWS ecosystem (S3, Lambda, SageMaker). Best for enterprises already on AWS who need diarization as part of a larger pipeline. Custom vocabulary and custom language models available.

Pricing: ~$1.44/hr standard, $0.30/hr surcharge for enhanced model with diarization features. Total ~$1.74–$2.04/hr depending on region and model.

Google Cloud Speech-to-Text — 125+ Languages ($1.44–$2.16/hr)

Google Cloud offers diarization through the enhanced model at $1.44–$2.16/hr depending on features and region. Widest language support at 125+. Auto speaker count detection. Integrates with Google Cloud ecosystem (BigQuery, Vertex AI). Speaker diarization requires the enhanced model — the standard model does not support it.

Best for: Enterprises on GCP needing diarization across many languages with cloud-native integration.

Category C: Open-Source (pyannote 3.1)

pyannote.audio 3.1 — Gold Standard Open-Source Diarization

Best for: Developers who want full pipeline control
Price: Free (open-source) | pyannoteAI: €19–€99/mo
Languages: Language-agnostic | Max speakers: Configurable

pyannote.audio 3.1 is the de facto standard for open-source speaker diarization. DER ranges from 11% (clean, 2 speakers) to 19% (noisy, many speakers) on standard benchmarks. Language-agnostic — works on any language without language-specific models. Commonly paired with OpenAI Whisper for a complete open-source transcription + diarization pipeline.

GPU required for practical use (CPU inference is 10–50x slower). The commercial pyannoteAI service (€19–€99/mo) offers a 28% DER improvement over the open-source version with proprietary model weights.

Pros:

  • ✓ Free and open-source (MIT license)
  • ✓ Full pipeline control — customize every stage
  • ✓ Language-agnostic (works on any language)
  • ✓ Configurable max speakers
  • ✓ Active community and research papers
  • ✓ Pairs with Whisper for end-to-end pipeline

Cons:

  • ✗ GPU required (NVIDIA recommended, 4GB+ VRAM)
  • ✗ DER 11–19% — worse than best commercial tools (7.2%)
  • ✗ Requires Python development skills
  • ✗ No UI — command-line/code only
  • ✗ Self-hosted infrastructure costs
Choose if: You're a developer who needs full control over the diarization pipeline, wants to avoid per-API-call costs at scale, or needs to run diarization on-premise for privacy/compliance.

Open-Source vs. Commercial Accuracy Gap

pyannoteAI (commercial) achieves 28% lower DER than pyannote 3.1 (open-source) on the same benchmarks. If you need the best open-source accuracy without paying for the commercial version, fine-tuning on your specific domain data can close most of the gap.

Excluded Open-Source Options:

  • Kaldi: Legacy toolkit, largely superseded by pyannote for diarization
  • SpeechBrain: Good toolkit but less adopted specifically for diarization
  • NVIDIA NeMo: NVIDIA GPU-specific, narrower community
  • Picovoice Falcon: On-device niche, limited to 2 speakers

Full Comparison: All 14 Tools

ToolCategoryCost/hrMax SpeakersLanguagesDER (approx)Voice Profiles
NovaScribeConsumer$0.20–$0.60Auto99Not benchmarked
Fireflies.aiConsumer$0.60–$1.0850100+~7.2%
Otter.aiConsumer$0.42–$0.85Auto30+~10.7%
NottaConsumer$0.50–$0.93Auto104~8.5%
DescriptConsumer$1.60–$2.408+30+Not published
RiversideConsumer$0.96–$1.168–10N/A0%N/A
SonixConsumer$5–$10/hrAuto49+Not published
Rev HumanConsumer$119.40/hrUnlimited15+~0%Human
AssemblyAIAPI$0.17/hr3016–99~10% (est.)✓ (add-on)
DeepgramAPI$0.58/hr16+45+Not published
OpenAIAPI$0.36/hrAuto99+Not published✓ (4 refs)
AWS TranscribeAPI$1.74–$2.04/hr30100+Not published
Google CloudAPI$1.44–$2.16/hrAuto125+Not published
pyannote 3.1Open-sourceFree (GPU costs)ConfigurableAgnostic11–19%

Legend: ✓ = Supported | ✗ = Not supported. Cost/hr calculated from cheapest plan with diarization. All pricing verified April 2026.

How We Tested Speaker Diarization Tools

We evaluated each tool on diarization-specific criteria. DER benchmarks come from independent testing (SummarizeMeeting 2026) and vendor-reported data where independent results were unavailable. See our multi-speaker transcription comparison for additional testing methodology.

Test Recordings:

TestDurationDetails
2-Speaker Interview42 minClear audio, minimal overlap, recorded on Zoom
5-Speaker Meeting58 minTeam standup with frequent turn-taking, Google Meet
8-Speaker Panel90 minConference panel with overlapping speech, audience noise
Noisy Environment30 min3 speakers in a café with background noise

What We Measured:

  • DER (Diarization Error Rate) — false alarms + missed speech + speaker confusion
  • Speaker count accuracy — how often the tool correctly identifies the number of speakers
  • Overlap handling — behavior when two speakers talk simultaneously
  • Latency — time from upload to completed diarization
  • Speaker label consistency — does Speaker 1 stay Speaker 1 throughout the recording?

Pricing sources: Each tool's official pricing page, verified April 2, 2026. API pricing reflects standard tier without volume discounts.

Last tested: April 2026
Last updated: April 2, 2026
Initial publish: All 14 tools tested and reviewed

Frequently Asked Questions

What is speaker diarization?

Speaker diarization is AI technology that identifies “who spoke when” in audio with multiple speakers. It labels each segment as Speaker 1, Speaker 2, etc. Unlike simple transcription, diarization separates overlapping conversations and attributes each word to the correct person.

How accurate is speaker diarization?

90–97% with 2–4 speakers on clean audio. Degrades to 85–90% with 5–8 speakers. Fireflies leads benchmarks at 92.8% overall accuracy (7.2% DER). Overlapping speech is the biggest challenge for all tools.

What’s the difference between diarization and speaker identification?

Diarization assigns generic labels (Speaker 1, Speaker 2) without knowing who the speakers are — it’s unsupervised. Identification matches voices to specific known people (requires prior voice enrollment). Otter.ai and OpenAI’s API support identification via voice profiles.

How many speakers can diarization handle?

Most consumer tools handle 2–10 reliably. Fireflies claims up to 50. AssemblyAI API supports up to 30. Accuracy decreases as speaker count increases — expect 85–90% with 5–8 speakers and 80–85% with 9–15.

What is DER (Diarization Error Rate)?

DER is the standard accuracy metric for speaker diarization. It measures false alarms + missed speech + speaker confusion as a percentage of total speech duration. Below 10% is considered good. Fireflies achieves 7.2% DER, Notta 8.5%, and Otter 10.7%.

Does diarization work with overlapping speech?

Poorly. Overlapping speech is the #1 failure mode for all diarization systems. Modern tools are improving but still struggle when two people talk simultaneously. Riverside avoids the problem entirely by recording each speaker on a separate audio track.

Which is the cheapest tool with speaker diarization?

NovaScribe at $2/month includes auto-diarization on all plans — no tier-gating. Next cheapest: Fireflies free tier (limited minutes) and Otter free (300 min/mo). For APIs, AssemblyAI at $0.17/hr is the cheapest developer option with diarization.

Can I get 100% accurate speaker separation?

Yes — record each speaker on a separate audio track. Riverside does this automatically for remote calls. Alternatively, use Rev’s human transcription ($1.99/min) for near-perfect speaker attribution by trained transcriptionists.

Ready to Transcribe Multi-Speaker Audio?

Start with 30 free minutes. Auto-diarization included on every plan — no tier-gating. From $2/mo.