STT Showdown 2026: ElevenLabs vs Whisper vs Deepgram

ElevenLabs Scribe v2, OpenAI Whisper (GPT-4o), and Deepgram Nova-3 compared! Based on June 2026 independent benchmarks and official USD pricing, we analyze accuracy, cost, multilingual performance, and speaker diarization. Discover the best speech-to-text AI for your specific use cases.

🎯 What you’ll learn in this post

• The real accuracy rankings of the top 3 STT providers as of June 2026, based on independent benchmarks (Artificial Analysis)
• Cost-per-hour comparison — identifying the most budget-friendly API for your workload
The honest reality of Korean language performance (and when to consider specialized local engines)
• Crucial differences in diarization and word-level timestamps — the deciding factors for your subtitle and meeting minutes workflow
• Final recommendations by use case: Subtitle creation, real-time streaming, and self-hosting

 

📌 Introduction

Hello from the ElevenLabs Lab!

While many associate ElevenLabs primarily with TTS, the official launch of Scribe v2 (batch transcription) in January 2026 has firmly established us as a leading competitor to OpenAI Whisper and Deepgram in the speech-to-text (STT) market.
(Note: Scribe v2 Realtime for live streaming was released in November 2025 — as per our official blog announcement.)

 

In our previous intro to Scribe, we covered the basics. Today, we’re answering the burning question: "Which service should I actually use?" We’ve broken this down using independent benchmarks and current official pricing. Any data points provided by the vendors themselves are clearly marked as "internal benchmarks."

 

⚡ The 3-point takeaway for the busy reader

1️⃣ Batch transcription for subtitles, meeting notes, or podcastsScribe v2 (Top-tier accuracy in independent benchmarks + cost-effective at $0.22/hour)
2️⃣ Large-scale real-time streaming or call centersDeepgram Nova-3 (Superior processing speed, streaming-optimized pricing, and high concurrency)
3️⃣ Zero cost & data sovereignty is your prioritySelf-hosted Whisper (MIT licensed, though you will need to implement your own diarization)

 

 

📖 Before we dive in — a quick glossary ⚡

STT = Speech-to-Text. AI that transcribes spoken language into text (the engine behind meeting minutes and video captions).
WER = Word Error Rate. A metric of how many words were missed out of 100 — the lower, the better.
Diarization = The ability to identify individual speakers (e.g., "Speaker A said this, Speaker B said that"). Critical for meeting transcripts.
Word-level Timestamps = Attaching a timestamp (MM:SS) to every individual word — essential for frame-perfect subtitle synchronization.

 

📊 1. Accuracy — What the independent benchmarks say

The most common trap in STT comparison is relying solely on vendor-provided data. Every company claims to be #1. That’s why we looked at the Artificial Analysis AA-WER index (June 2026; lower is more accurate).

 

Model

AA-WER (Lower is better)

Speed Factor

ElevenLabs Scribe v2

2.2% (Overall 2nd)

34.0x

OpenAI gpt-4o-transcribe

4.0%

OpenAI gpt-4o-mini-transcribe

4.5%

Deepgram Nova-3

5.2%

504.4x (Dominant #1)

▲ Source: Artificial Analysis Speech-to-Text Leaderboard (Verified June 2026)

 

In summary: Scribe v2 outperforms OpenAI and Deepgram in accuracy, while Deepgram leads significantly in processing speed at 504x. Think of it this way: Deepgram can process an hour of audio in seconds, while Scribe v2 takes roughly two minutes.

For context, ElevenLabs claims Scribe v2 Realtime has a 93.5% average accuracy across 30 European/Asian languages—but as this is an internal benchmark, please treat it as a general indicator rather than absolute fact.

 

💰 2. Pricing — Comparing by the hour

Category

Scribe v2

OpenAI

Deepgram Nova-3

Batch Transcription

$0.22/hour

$0.36/hour (gpt-4o-transcribe)
$0.18/hour (mini)

$0.46/hour (single language)

Real-time Streaming

$0.39/hour

Realtime API token usage (varies)

$0.29/hour

Diarization

Included by default

diarize model $0.36/hour

Included by default

Free Trial

Free plan: 10,000 credits/mo

No free API tier

$200 credit (No card required)

▲ Source: elevenlabs.io/pricing/api · developers.openai.com pricing · deepgram.com/pricing (Verified June 2026)

 

Key takeaways:

  • Batch transcription value goes to Scribe v2 — At $0.22/hour, it is significantly more affordable than gpt-4o-transcribe ($0.36) while delivering higher accuracy. Our Creator plan ($22/mo) includes 100 hours of batch transcription.

  • Streaming cost goes to Deepgram — $0.29/hour with per-second billing, supporting up to 150 concurrent WebSocket connections even on PAYG plans.

  • Deepgram’s $200 free credit is the most generous, covering roughly 433 hours for Nova-3. It’s an ideal way to prototype without initial investment.

 

🇰🇷 3. Korean language performance — An honest assessment

This is perhaps the most important section for anyone whose audio isn't English-only. To be blunt: no global STT API performs in Korean as flawlessly as it does in English.

 

In ElevenLabs' official language tier list for Scribe, Korean sits in the 'Good' tier (10–20% WER) — one full tier below 'Excellent' (WER under 5%), the top category of 36 languages that includes English, German, French, Spanish, and Japanese. The good news for readers of this English edition: English is squarely in that Excellent tier.

For Korean specifically, the most frequently cited number is ElevenLabs' own internal benchmark: Scribe v1 scored 10.7% WER on the Korean portion of the FLEURS dataset. Keep the caveats in mind — it's a vendor-published figure, it covers Scribe v1 rather than v2, and it applies to Korean only, so don't project it onto English or any other language.

 

For a more nuanced perspective: independent Korean-language benchmarks (measured in Character Error Rate) put specialized local engines such as Return Zero and Naver Clova at roughly 5.9–7.5% CER, often ahead of global models like Whisper. (Note: that's a Korean-only story, and those comparisons did not include Scribe or Nova-3.)

If your primary workload is high-volume Korean transcription, we highly recommend testing those specialized local APIs alongside the global players. Conversely, if you're working on English-centric or multilingual content (global YouTube channels, dubbing workflows), the global big three remain your best bet. We'd rather be transparent than push a tool that isn't the right fit for your language mix. 😅

 

🧰 4. Feature differences — What matters for subtitles & transcripts

  • Diarization: Built into Scribe v2 and Deepgram. Open-source Whisper lacks this natively, requiring third-party libraries like `pyannote`. OpenAI’s `gpt-4o-transcribe-diarize` supports this at no extra cost, allowing pre-registration of up to 4 speakers.

  • Word-level Timestamps: Supported natively by Scribe v2. OpenAI only supports this in the legacy whisper-1 model, not the modern gpt-4o-transcribe series—a common pitfall for those building automated subtitle workflows.

  • Terminology calibration: Deepgram’s "Keyterm Prompting" supports Korean, which is invaluable for transcripts heavy on technical jargon or brand names.

 

Test with the Scribe v2 Free Plan →

 

🎯 5. Final recommendations by use case

Use Case

Recommendation

Reasoning

YouTube subs, Podcasts, Minutes

Scribe v2

Top-tier accuracy + built-in diarization/timestamps + competitive pricing

Live call centers, Live subs

Deepgram Nova-3

504x speed, cost-effective streaming, high concurrency

Zero budget, On-prem, Data sovereignty

Self-hosted Whisper

MIT licensed. Requires custom diarization and self-managed GPU infra

High-volume Korean-only transcription

Compare local specialized APIs

Regional engines often maintain an edge in CER benchmarks

 

⚠️ 6. Weaknesses to consider

  • Scribe v2: Processing speed (34x) is slower than Deepgram. Credit-to-hour consumption varies, so we recommend monitoring your dashboard as you scale.

  • OpenAI: No free API tier; the newest models lack word-level timestamps; real-time pricing via tokens makes forecasting costs difficult.

  • Deepgram: Lower accuracy (5.2%) compared to Scribe v2 in independent benchmarks; later integration of Korean language support (Nova-3, 2026).

  • Whisper (Open-source): No major updates since large-v3-turbo (Oct 2024); you are solely responsible for diarization and infrastructure maintenance.

 

🚀 Final thoughts — The value of A/B testing

Benchmarks are merely references. Performance with your specific audio—considering recording quality, accents, and unique jargon—must be verified through real-world testing. With the ElevenLabs Free plan and Deepgram’s $200 credit, you can perform head-to-head comparisons using your own files at zero cost.

If you're new to integrating APIs, check out our Voice AI API integration guide, or for TTS comparisons, refer to our ElevenLabs vs Google TTS vs Amazon Polly comparison.

 

Get started with ElevenLabs Scribe (Free) →

 

ElevenLabs Lab signing off. ⚡