There are dozens of public comparison datasets, universally accepted accuracy metrics, and no shortage of academic papers on ASR, so you might think that identifying the best-performing transcription API would be easy. But like many things in life, details matter, and most benchmarks are based on clean data sets that don’t reflect the real-world reality.
Deepgram claims in their own article on the best speech-to-text APIs that their transcription model Nova-3, has a 5.26% WER. Yet, when tested against more challenging and realistic ASR datasets, Nova-3 showed a WER as high as 28.1% (with Earnings-22).
This highlights a major problem with trying to use ASR benchmarking to predict real-world outcomes: most transcription providers benchmark on LibriSpeech (LS).
I’ll get into why that’s an issue below, but first, let’s compare Deepgram’s real benchmarks against those of our own model, Modulate.
Setting up the experiment: Testing Modulate and Deepgram with real-world audio
Most datasets aren’t robust enough to provide true benchmarks, and no one benchmark can tell the complete story, which is why we tested Deepgram (Nova-2 and Nova-3) and Modulate with three independent benchmark datasets:
- AMI Meeting Corpus
- VoxPopuli
- Earnings-22
AMI Meeting Corpus
AMI Meeting Corpus was created by the University of Edinburgh, along with a few collaborators, as part of an Augmented Multi-party Interaction (AMI) project. It’s one of the most challenging and realistic ASR benchmarks available because it consists of multi-speaker meetings, standups, and collaborative calls.
There is overlapping speech, interruptions, crosstalk, Real background noise (typing, paper rustling, room acoustics), and natural conversational dynamics, making it a sort of “gold standard” for real-world conversational ASR.
VoxPopuli
VoxPopuli is a multilingual corpus of parliamentary speeches, which doesn’t sound as though it’d be as reliable for good benchmarking as datasets like AMI.
However, due to its long-form, structured, single-speaker speech that still includes varied accents, room acoustics, and microphone setups, it is still helpful for testing how well a model performs on formal, high-quality speech.
Both AMI and Vox help to set a spectrum of performance for the two models
Earnings-22
Earnings-22 is a collection of earnings calls with long-form (30-60 minutes), domain‑specific vocabulary, overlapping speakers from both the executive and analyst side, and noisy conditions. The audio quality is variable, depending on the individual phone lines and conferencing systems.
Translation: it’s one of the best ways to test how models handle sensitive conversations that include heavy financial jargon and require a lot of long context.
It’s that long-context reasoning, realistic call-center-style of the audio, and specialized terminology that make this dataset a fairly common one in robust ASR research. Particularly for models with intended use for contact centers, sales, and customer support.
WER: How ASR companies benchmark models
WER (Word Error Rate) measures how well an ASR system transcribes audio by counting the following three errors relative to the total number of words in the correct transcript:
- Insertions
- Deletions
- Substitutions
WER is calculated with this formula:
Results of testing Modulate and Deepgram across three challenging benchmarks
We know that, in a clean test, Nova-3 has a batch WER of 5.26%, right?
No matter which model you test… if you’re using a clean dataset, you’re going to get a WER below 10%. Once you add the level of formatting in the datasets we used, you see errors increase by an average of 10% for most models.
That’s what makes the following data rather impressive.
Overall performance results
Both ASR models show a marked increase in errors against these datasets versus clean datasets like LibriSpeech. However, we can see that Deepgram’s error rate jumped from a range of 3% to nearly 23% with AMI.
Benchmark | Description (Short) | Velma WER | Nova-3 WER | Gap / Advantage |
Earnings‑22 | Long‑form, noisy, domain‑specific audio | 7.5% | 15.7% | ~2× better |
VoxPopuli | Clean parliamentary speech | 8.0% | 8.2% | Nearly identical |
AMI | Multi‑speaker, overlapping, noisy | 14.9% | 28.1% | ~2× better |
There’s a clear pattern: Modulate’s API consistently delivers lower error rates, and all models scale similarly as difficulty increases. With such dramatic differences in absolute accuracy, it’s worth asking what’s driving the gap.
Is the length of the audio segments driving the gap?
Most ASR models are pre-trained on audio segments of around 30 seconds. Most datasets split audio segments into around 30 seconds each, as can be seen below, with the exception of AMI.
Dataset | How Segments Are Defined | Typical Segment Length |
Earnings‑22 | Full calls + optional chunked version | ~30s |
VoxPopuli | Segmented by alignment scripts | ~15 to 30 seconds |
AMI | Utterance‑based segmentation | ~5s to ~120s (largely under 60) |
With these considerations, it doesn’t appear that the length of the audio segments has any impact on the outcome. So we must consider that is, instead, the difficulty of the audio itself, and the challenges of calculating WER.
The challenges with calculating WER
It’s important to remember that not all errors are the same. Capitalization, punctuation, and spelling differences may be weighted differently or improperly, impacting the overall WER regardless of how technically correct the translation is.
For instance, the translation might use the British spelling of a word (i.e., center; analyse), or might include the actual number rather than spelling the number out (“15” instead of “fifteen”).
There are also conversation-specific markers that impact the accuracy of a transcription, including:
- Non‑lexical vocalizations: Short sounds like “mmh,” “mhm,” or “uh‑huh” that signal agreement, hesitation, or active listening.
- Audible events: Background or bodily sounds such as coughing, groaning, throat‑clearing, or sneezing that occur during speech.
- Prosodic markers: Symbols used to indicate changes in pitch or intonation, such as a rise, fall, or mid‑tone shift in the speaker’s voice.
- Speech‑termination markers (or disfluency terminators): Indicators of trailing off, interruptions, or rising intonation that signal incomplete or questioning utterances.
One of the reasons we tested Modulate’s transcription API and Deepgram against AMI, VOX, and Earnings-22 is due to the prevalence of those conversation-specific markers.
They pose a real challenge to most ASR systems, increasing the number of transcription errors, making it a truer representation of how well those systems hold up to real-world scenarios.
Earnings-22 results
Earnings-22 calls are long and noisy, but also filled with domain-specific vocabulary. The nature of this material lends itself to a greater number of conversational markers. For this reason, it’s one of the most challenging datasets.
Vox Populi
As the cleanest dataset of the three, with the most structure, and featuring just one speaker, we see that both models performed relatively the same with Vox Populi.
Of the three, we also clearly see that this dataset provided the least amount of conversational markers, and therefore, less of a challenge.
AMI Meeting Corpus
AMI is notoriously difficult due to the multiple speakers, whose voices often overlap, as well as the overwhelming amount of conversational markers. It is for these reasons that we see the largest gap between Deepgram and Modulate, which had half the error rate as its opposition, at 14.9% (vs. 28.1%).
True cost comparison (not just price-per-minute)
With all the accuracy on Modulate’s side, it’s also worth noting that it also comes at a better price than its legacy counterpart. Not just on base transcription price, which is what most model comparisons focus on, but on real cost, batch all-in, and streaming all-in.
Batch and streaming all-in
Batch all‑in refers to the total cost per hour when you process pre‑recorded audio (not live) and include every feature you actually need to run ASR in production.
Streaming all‑in refers to the total cost per hour when you process live audio (real‑time or near‑real‑time) with all required features included.
When you compare the two, you see that Modulate is clearly providing both greater accuracy across various challenging datasets, at a lower cost for both batch and streaming all-in:
Modulate comes in at the low price of just $0.03/hr vs Deepgram $0.38/hr (12x gap) for batch all-in and $0.06/hr vs Deepgram $0.58/hr (10x gap) for streaming.
Real cost comparison
Real cost = transcription + diarization + redaction + intelligence features
Deepgram charges $0.12/hr extra for redaction and diarization. Modulate includes both features for free, while also providing emotional detection (20+ emotions) and accent detection (20+ accents), features that Deepgram doesn’t offer at any price point.
Deepgram’s pricing model looks manageable at small volumes, but as soon as you scale, the diarization costs quickly multiply. If using the ASR for fraud detection, Deepgram will also require you to build your own emotion/accent classifiers.
For teams running millions of minutes per month, those costs balloon quickly.
Modulate makes a more budget-friendly option by including everything in their base rate.
The feature gap (what you’re getting… and not)
Deepgram is one of the most prominent ASRs because it delivers solid speech-to-text. But that’s where it ends.
Additional features that might be necessary for use by call centers, sales teams, and customer support conversations, especially in the case of fraud detection, include:
- Emotion detection
- Accent classification
- Higher-order behavioral signals
- Redaction
- Diarization
- Deepfake detection
Let’s look at this side-by-side.
Feature | Modulate | Deepgram |
Transcription | Included | Included |
Emotion Detection | Included | Not available |
Accent Detection | Included | Not available |
PII Redaction | Included | Included |
Speaker Diarization | Included | Available, +$0.12/hr |
Deepfake detection | Included | No |
| All Features Included in Base Price | Yes | No |
Deepgram is fundamentally an ASR engine, meaning it can only return text. Modulate is a conversation‑understanding engine.
When you need a basic transcription ASR, this difference doesn’t have as large an impact. However, contact centers looking to flag frustration, confusion, or escalation risk in real-time might favor Modulate.
The addition of emotional context and speaker state also opens Modulate up for use in agent coaching and QA automation. Its vocal stress detection, deception cues, and accent patterns make fraud detection an easy task.
Clean ASR benchmarks fail to capture real-world performance
Most benchmark comparisons are based on the LibriSpeech dataset because it is free, scalable, and clean. But because it contains around 1,000 hours of 16 kHz read English speech from LibriVox audiobooks, it’s… too clean.
The audio is read, not conversational. It doesn’t contain accents or emotional variance. There isn’t any background noise. All of the markers you’d expect in real, live conversations with customers are non-existent in the LibriSpeech dataset.
Naturally, current ASR models perform well on clean datasets, which isn’t limited to LS, but also to frequently used datasets like Common Voice (CV) and FLEURS (or FLoRes; FL), as well. Put those models against real-world datasets like TalkBank, and you see jumps in performance as we did with Deepgram’s Nova-3.
Even TalkBank has its limitations, though. Consisting of fifteen datasets, collected and recorded by different researchers with different goals, is just one of the issues.
More realistic benchmarks come from testing models against harder datasets like AMI Meeting Corpus and VoxPopuli, which is what we’ve used to test both Modulate and Deepgram.
How to run this benchmark yourself
While benchmarks help you to get a basic understanding of how an ASR system might perform, there’s no match for testing against your own production audio.
Here’s a practical framework for benchmarking the ASRs you’re vetting with your own audio:
Step 1: Collect audio samples
Most datasets consist of over 100 audio samples. It might not be feasible to collect as many on your own, but intend to collect 50 to 100 samples from your own recordings database.
Ensure that these samples contain:
- Accents
- Specialized vocabulary
- Background noise
- More than one speaker
- Conversational markers (interruptions, umm’s and uhh’s)
Each clip needs to be at least 30-60 seconds for any real evaluation. If you’re just getting one sentence in the clip, you’re going to lack the conversational aspect needed.
Step 2: Run through each provider's API
If you’re testing multiple API’s, keep everything the same across the board. Use the same audio files, the same sampling rate, chunking strategy, and diarization/redaction settings. Essentially, you want to create your own dataset that you can use again and again without modifications.
You want to compare model performance, after all, not differences in preprocessing.
Step 3: Calculate WER against human-verified transcripts
To get an accurate measure, you need a ground-truth transcript for each clip. That means a human transcriber or some other kind of double-pass verification process. If you already have QA-verified transcripts, you can also use those and save yourself the time and effort.
Step 4: Price out total cost with all the features you need
As I’ve pointed out throughout this piece, accuracy is only half the story. You also need to calculate the true cost, considering:
- Base transaction price
- Diarization
- Redaction
- Emotion detection
- Accent detection
- Latency requirements
- Batch/ streaming pricing
As we’ve discovered in this experiment, benchmarks with datasets that are too clean don’t give us accurate performance insights. Testing your own audio is one of the best ways to ensure you’re getting an accurate performance benchmark for the APIs you’re evaluating.
If you’re looking for a reproducible benchmarking script (and an example API request), we have it here for you.
