🎯 What you’ll learn in this post
• The real accuracy rankings of the top 3 STT providers as of June 2026, based on independent benchmarks (Artificial Analysis)
• Cost-per-hour comparison — identifying the most budget-friendly API for your workload
• The honest reality of Korean language performance (and when to consider specialized local engines)
• Crucial differences in diarization and word-level timestamps — the deciding factors for your subtitle and meeting minutes workflow
• Final recommendations by use case: Subtitle creation, real-time streaming, and self-hosting
📌 Introduction
Hello from the ElevenLabs Lab!
While many associate ElevenLabs primarily with TTS, the official launch of Scribe v2 (batch transcription) in January 2026 has firmly established us as a leading competitor to OpenAI Whisper and Deepgram in the speech-to-text (STT) market.
(Note: Scribe v2 Realtime for live streaming was released in November 2025 — as per our official blog announcement.)
In our previous intro to Scribe, we covered the basics. Today, we’re answering the burning question: "Which service should I actually use?" We’ve broken this down using independent benchmarks and current official pricing. Any data points provided by the vendors themselves are clearly marked as "internal benchmarks."
⚡ The 3-point takeaway for the busy reader
1️⃣ Batch transcription for subtitles, meeting notes, or podcasts → Scribe v2 (Top-tier accuracy in independent benchmarks + cost-effective at $0.22/hour)
2️⃣ Large-scale real-time streaming or call centers → Deepgram Nova-3 (Superior processing speed, streaming-optimized pricing, and high concurrency)
3️⃣ Zero cost & data sovereignty is your priority → Self-hosted Whisper (MIT licensed, though you will need to implement your own diarization)
📖 Before we dive in — a quick glossary ⚡
• STT = Speech-to-Text. AI that transcribes spoken language into text (the engine behind meeting minutes and video captions).
• WER = Word Error Rate. A metric of how many words were missed out of 100 — the lower, the better.
• Diarization = The ability to identify individual speakers (e.g., "Speaker A said this, Speaker B said that"). Critical for meeting transcripts.
• Word-level Timestamps = Attaching a timestamp (MM:SS) to every individual word — essential for frame-perfect subtitle synchronization.
📊 1. Accuracy — What the independent benchmarks say
The most common trap in STT comparison is relying solely on vendor-provided data. Every company claims to be #1. That’s why we looked at the Artificial Analysis AA-WER index (June 2026; lower is more accurate).
Model | AA-WER (Lower is better) | Speed Factor |
|---|---|---|
ElevenLabs Scribe v2 | 2.2% (Overall 2nd) | 34.0x |
OpenAI gpt-4o-transcribe | 4.0% | — |
OpenAI gpt-4o-mini-transcribe | 4.5% | — |
Deepgram Nova-3 | 5.2% | 504.4x (Dominant #1) |
▲ Source: Artificial Analysis Speech-to-Text Leaderboard (Verified June 2026)
In summary: Scribe v2 outperforms OpenAI and Deepgram in accuracy, while Deepgram leads significantly in processing speed at 504x. Think of it this way: Deepgram can process an hour of audio in seconds, while Scribe v2 takes roughly two minutes.
For context, ElevenLabs claims Scribe v2 Realtime has a 93.5% average accuracy across 30 European/Asian languages—but as this is an internal benchmark, please treat it as a general indicator rather than absolute fact.
💰 2. Pricing — Comparing by the hour
Category | Scribe v2 | OpenAI | Deepgram Nova-3 |
|---|---|---|---|
Batch Transcription | $0.22/hour | $0.36/hour (gpt-4o-transcribe) | $0.46/hour (single language) |
Real-time Streaming | $0.39/hour | Realtime API token usage (varies) | $0.29/hour |
Diarization | Included by default | diarize model $0.36/hour | Included by default |
Free Trial | Free plan: 10,000 credits/mo | No free API tier | $200 credit (No card required) |
▲ Source: elevenlabs.io/pricing/api · developers.openai.com pricing · deepgram.com/pricing (Verified June 2026)
Key takeaways:
Batch transcription value goes to Scribe v2 — At $0.22/hour, it is significantly more affordable than gpt-4o-transcribe ($0.36) while delivering higher accuracy. Our Creator plan ($22/mo) includes 100 hours of batch transcription.
Streaming cost goes to Deepgram — $0.29/hour with per-second billing, supporting up to 150 concurrent WebSocket connections even on PAYG plans.
Deepgram’s $200 free credit is the most generous, covering roughly 433 hours for Nova-3. It’s an ideal way to prototype without initial investment.
🇰🇷 3. Korean language performance — An honest assessment
This is perhaps the most important section for anyone whose audio isn't English-only. To be blunt: no global STT API performs in Korean as flawlessly as it does in English.
In ElevenLabs' official language tier list for Scribe, Korean sits in the 'Good' tier (10–20% WER) — one full tier below 'Excellent' (WER under 5%), the top category of 36 languages that includes English, German, French, Spanish, and Japanese. The good news for readers of this English edition: English is squarely in that Excellent tier. ⚡
For Korean specifically, the most frequently cited number is ElevenLabs' own internal benchmark: Scribe v1 scored 10.7% WER on the Korean portion of the FLEURS dataset. Keep the caveats in mind — it's a vendor-published figure, it covers Scribe v1 rather than v2, and it applies to Korean only, so don't project it onto English or any other language.
For a more nuanced perspective: independent Korean-language benchmarks (measured in Character Error Rate) put specialized local engines such as Return Zero and Naver Clova at roughly 5.9–7.5% CER, often ahead of global models like Whisper. (Note: that's a Korean-only story, and those comparisons did not include Scribe or Nova-3.)
If your primary workload is high-volume Korean transcription, we highly recommend testing those specialized local APIs alongside the global players. Conversely, if you're working on English-centric or multilingual content (global YouTube channels, dubbing workflows), the global big three remain your best bet. We'd rather be transparent than push a tool that isn't the right fit for your language mix. 😅
🧰 4. Feature differences — What matters for subtitles & transcripts
Diarization: Built into Scribe v2 and Deepgram. Open-source Whisper lacks this natively, requiring third-party libraries like `pyannote`. OpenAI’s `gpt-4o-transcribe-diarize` supports this at no extra cost, allowing pre-registration of up to 4 speakers.
Word-level Timestamps: Supported natively by Scribe v2. OpenAI only supports this in the legacy whisper-1 model, not the modern gpt-4o-transcribe series—a common pitfall for those building automated subtitle workflows.
Terminology calibration: Deepgram’s "Keyterm Prompting" supports Korean, which is invaluable for transcripts heavy on technical jargon or brand names.
Test with the Scribe v2 Free Plan →
🎯 5. Final recommendations by use case
Use Case | Recommendation | Reasoning |
|---|---|---|
YouTube subs, Podcasts, Minutes | Scribe v2 | Top-tier accuracy + built-in diarization/timestamps + competitive pricing |
Live call centers, Live subs | Deepgram Nova-3 | 504x speed, cost-effective streaming, high concurrency |
Zero budget, On-prem, Data sovereignty | Self-hosted Whisper | MIT licensed. Requires custom diarization and self-managed GPU infra |
High-volume Korean-only transcription | Compare local specialized APIs | Regional engines often maintain an edge in CER benchmarks |
⚠️ 6. Weaknesses to consider
Scribe v2: Processing speed (34x) is slower than Deepgram. Credit-to-hour consumption varies, so we recommend monitoring your dashboard as you scale.
OpenAI: No free API tier; the newest models lack word-level timestamps; real-time pricing via tokens makes forecasting costs difficult.
Deepgram: Lower accuracy (5.2%) compared to Scribe v2 in independent benchmarks; later integration of Korean language support (Nova-3, 2026).
Whisper (Open-source): No major updates since large-v3-turbo (Oct 2024); you are solely responsible for diarization and infrastructure maintenance.
🚀 Final thoughts — The value of A/B testing
Benchmarks are merely references. Performance with your specific audio—considering recording quality, accents, and unique jargon—must be verified through real-world testing. With the ElevenLabs Free plan and Deepgram’s $200 credit, you can perform head-to-head comparisons using your own files at zero cost.
If you're new to integrating APIs, check out our Voice AI API integration guide, or for TTS comparisons, refer to our ElevenLabs vs Google TTS vs Amazon Polly comparison.
Get started with ElevenLabs Scribe (Free) →
ElevenLabs Lab signing off. ⚡