Ursa provides the world’s most accurate speech-to-text and is a breakthrough in accessibility, reducing the digital divide for voices that other speech-to-text systems struggle to recognize. We’ve been blown away by how Ursa can accurately transcribe challenging speech - in particular, speech that would have been impossible for a human to grasp without first reading the transcript. We found Ursa can transcribe singing very well too, even though we didn't specifically train for this use case. You may like to try and understand the following clips without showing the transcripts first.
What sets Ursa apart from other speech-to-text offerings is its exceptional accuracy. Moving to GPUs for inference and scaling up our models has allowed Ursa’s enhanced model to surpass human-level transcription accuracy on the Kincaid46[1] dataset† and remove an additional 1 in 5 errors on average compared to Microsoft, the nearest large cloud vendor. Both Ursa’s standard and enhanced English models outperform all other vendors, delivering a significant 35% and 22% relative improvement respectively, compared to our previous release (shown in Table 1).
Ursa-quality transcription is also available for real-time recognition, leveraging the same underlying models. For the first time, we’re making GPU-accelerated transcription possible on-prem, with Ursa providing unrivaled accuracy and low total cost of ownership (TCO) to enterprises.
Additionally, we are proud to release new translation capabilities alongside our ground-breaking speech recognition. Together, these technologies break down language barriers and make a big leap towards our goal of understanding every voice.
Our Approach
We first train a self-supervised learning (SSL) model using over a million hours of unlabeled audio across 48 languages. This uses an efficient transformer variant that learns rich acoustic representations of speech (internally we name these models after bears, so we thought it was only fitting to call our release 'Ursa'). We then use paired audio-transcript data in a second stage to train an acoustic model that learns to map self-supervised representations to phoneme probabilities. The predicted phonemes are then mapped into a transcript by using a large language model to identify the most likely sequence of words.
Our diarization models exploit the same general self-supervised representations to enhance our transcripts with speaker information. We also apply inverse text normalization (ITN) models to process numerical entities in our transcriptions into a consistent and professional written form. Consistent ITN formatting is imperative when building applications that rely on dates, times, currencies, and contact information. Listen to the “Text Formatting” sample with the audio player above, which showcases our output or read about it in our blog.
We have made significant improvements at every stage of this pipeline, compounding to produce the accuracy gains shown in Table 1.
The Power of Scale
With Ursa, we achieved our breakthrough performance by scaling our SSL model by an order of magnitude to 2bn parameters and our language model by a factor of 30, both of which have been made possible by using GPUs for inference. GPUs have a highly parallel architecture enabling high throughput inference which means more streams of audio can be processed in parallel.
Building on the findings from DeepMind’s Chinchilla paper[2], summer intern Andy Lo from the University of Cambridge established scaling laws for our SSL models and showed that these transformer-based audio models show similar scaling properties to large language models. By scaling to 2bn parameters‡, our models are now capable of learning richer acoustic features from unlabeled multi-lingual data, allowing us to understand a larger spectrum of voice cohorts.
Not only do these representations boost accuracy when training an acoustic model but they vastly increase the sample efficiency of our training process meaning, reducing the training time of our state-of-the-art English models from weeks to days. Crucially, we do not need hundreds of thousands of hours of labeled audio to make the step change in accuracy shown by Ursa. This enables us to obtain Whisper-level accuracy with just a few thousand hours of audio (200x less).
Measuring Ursa's ASR Accuracy
To ensure a comprehensive evaluation of our systems, we calculate the word error rate (WER)**, across 14 short-form and 7 long-form open-source test sets, covering a wide range of domains such as audiobooks in LibriSpeech[3] and financial calls in Earnings-22[4]. Speaker diversity is also well covered from African American Vernacular English (AAVE) in CORAAL[5] and a global set of speakers in CommonVoice[6]. We take a weighted average based on the number of words in each test set. The averages quoted in Table 1 compare against the previous Speechmatics release while Table 2 compares against other speech-to-text vendors.
Table 1: A comparison of average word error rate (WER) for Ursa’s enhanced and standard models versus our previous release. The WER is significantly lower for Ursa with a 35% and 22% relative improvement for the standard and enhanced models, respectively. This means that using our latest enhanced model eliminates 1 in 5 errors made by our previous model.
Ursa’s enhanced model has unrivaled accuracy compared to other speech-to-text vendors with a 22% lead over Microsoft, the nearest large cloud provider (see Table 2). Meanwhile, Ursa’s standard model uses a smaller, self-supervised model to achieve higher throughput and can be used when speed is more desirable. However, it provides a 10% accuracy gain over Microsoft, meaning you can still expect market-leading accuracy even when using this operating point.
Table 2: A comparison of the average word error rate (WER) for Ursa's enhanced model (denoted by Speechmatics) and other speech-to-text vendors on the market‡‡. Speechmatics shows significant gains of between 22% versus Microsoft and 38% versus Google in relative accuracy improvement.
Recently, OpenAI conducted a study on human transcription[7] and used four human transcription services to transcribe the Kincaid46 test set, consisting of news broadcasts, podcasts, meetings, and phone calls. They found the WER ranged from 8.14% to 10.5%. Remarkably, Ursa’s enhanced model achieves a lower WER of 7.88%, surpassing human-level accuracy in this domain. Results can be replicated in this notebook. We’re excited to have come a long way since 2018 when this dataset was released and our WER was 23%!
Our new Translation API
Our translation offering allows you to translate between English and 34 languages. When combined with our accurate transcripts, users receive the best translation package for speech on the market. We find that when building such a composite system, the accuracy in the underlying speech-to-text system significantly increases translation performance as shown by the higher BLEU[8] scores in Table 3 and example output versus Google in Table 4. BLEU measures the similarity of the output to high quality reference transcripts, and the higher BLEU scores achieved by Ursa show the power of high-quality transcription in downstream tasks. We expect this trend to continue.
WER ↓ | BLEU ↑ | |
Speechmatics | 8.9 | 33.83 |
18.88 | 30.61 |
Table 3: Speechmatics scores a higher BLEU score compared to Google on the CoVoST2[9] dataset for translations from English to German. Speechmatics obtains a significantly lower word error rate (WER) than Google, aiding the translation service.
Transcript | Translation | |
---|---|---|
Speechmatics: | Did you give her the money? | Hast du ihr das Geld gegeben? |
Google: | Did you keep hear the money? | Hast du ihrdas Geld gehört? |
Table 4: An example that shows how errors from speech recognition impact the accuracy of translation. Words in red indicate the errors, with substitutions in italic and deletions crossed out.
Best-in-Class
Ursa represents a quantum leap forward in speech technologies, setting a new standard for the speech-to-text industry. Our scaled-up self-supervised model, combined with the power of GPU-based computing, has allowed Speechmatics to achieve unmatched accuracy, speed, and downstream performance. Ursa is the clear choice for anyone seeking best-in-class speech recognition and translation, delivering on our promise to understand every voice.
Stay tuned for more exciting updates from Speechmatics as we continue to push hard to understand every voice. Next, we will be doing a deeper analysis of accuracy on different demographics, how speech-to-text accuracy is essential for downstream tasks, a detailed breakdown of translation accuracy, and how Ursa is robust to noisy conditions.