Speechmatics | Introducing Ursa from Speechmatics

Ursa provides the world’s most accurate speech-to-text and is a breakthrough in accessibility, reducing the digital divide for voices that other speech-to-text systems struggle to recognize. We’ve been blown away by how Ursa can accurately transcribe challenging speech - in particular, speech that would have been impossible for a human to grasp without first reading the transcript. We found Ursa can transcribe singing very well too, even though we didn't specifically train for this use case. You may like to try and understand the following clips without showing the transcripts first.

What sets Ursa apart from other speech-to-text offerings is its exceptional accuracy. Moving to GPUs for inference and scaling up our models has allowed Ursa’s enhanced model to surpass human-level transcription accuracy on the Kincaid46^[1] dataset^† and remove an additional 1 in 5 errors on average compared to Microsoft, the nearest large cloud vendor. Both Ursa’s standard and enhanced English models outperform all other vendors, delivering a significant 35% and 22% relative improvement respectively, compared to our previous release (shown in Table 1).

Ursa-quality transcription is also available for real-time recognition, leveraging the same underlying models. For the first time, we’re making GPU-accelerated transcription possible on-prem, with Ursa providing unrivaled accuracy and low total cost of ownership (TCO) to enterprises.

Additionally, we are proud to release new translation capabilities alongside our ground-breaking speech recognition. Together, these technologies break down language barriers and make a big leap towards our goal of understanding every voice.

Our Approach

We first train a self-supervised learning (SSL) model using over a million hours of unlabeled audio across 48 languages. This uses an efficient transformer variant that learns rich acoustic representations of speech (internally we name these models after bears, so we thought it was only fitting to call our release 'Ursa'). We then use paired audio-transcript data in a second stage to train an acoustic model that learns to map self-supervised representations to phoneme probabilities. The predicted phonemes are then mapped into a transcript by using a large language model to identify the most likely sequence of words.

Our diarization models exploit the same general self-supervised representations to enhance our transcripts with speaker information. We also apply inverse text normalization (ITN) models to process numerical entities in our transcriptions into a consistent and professional written form. Consistent ITN formatting is imperative when building applications that rely on dates, times, currencies, and contact information. Listen to the “Text Formatting” sample with the audio player above, which showcases our output or read about it in our blog.

We have made significant improvements at every stage of this pipeline, compounding to produce the accuracy gains shown in Table 1.

The Power of Scale

With Ursa, we achieved our breakthrough performance by scaling our SSL model by an order of magnitude to 2bn parameters and our language model by a factor of 30, both of which have been made possible by using GPUs for inference. GPUs have a highly parallel architecture enabling high throughput inference which means more streams of audio can be processed in parallel.

Building on the findings from DeepMind’s Chinchilla paper^[2], summer intern Andy Lo from the University of Cambridge established scaling laws for our SSL models and showed that these transformer-based audio models show similar scaling properties to large language models. By scaling to 2bn parameters^‡, our models are now capable of learning richer acoustic features from unlabeled multi-lingual data, allowing us to understand a larger spectrum of voice cohorts.

Not only do these representations boost accuracy when training an acoustic model but they vastly increase the sample efficiency of our training process meaning, reducing the training time of our state-of-the-art English models from weeks to days. Crucially, we do not need hundreds of thousands of hours of labeled audio to make the step change in accuracy shown by Ursa. This enables us to obtain Whisper-level accuracy with just a few thousand hours of audio (200x less).

Measuring Ursa's ASR Accuracy

To ensure a comprehensive evaluation of our systems, we calculate the word error rate (WER)^**, across 14 short-form and 7 long-form open-source test sets, covering a wide range of domains such as audiobooks in LibriSpeech^[3] and financial calls in Earnings-22^[4]. Speaker diversity is also well covered from African American Vernacular English (AAVE) in CORAAL^[5] and a global set of speakers in CommonVoice^[6]. We take a weighted average based on the number of words in each test set. The averages quoted in Table 1 compare against the previous Speechmatics release while Table 2 compares against other speech-to-text vendors.

	Word Count	Enhanced	Previous Enhanced	Rel diff (%)	Standard	Previous Standard	Rel diff (%)
Weighted Average	3,911,327	11.96	15.36	22.16	13.89	21.33	34.88
Short-form Transcription
LibriSpeech.test-clean	53,027	2.44	4.09	40.34	3.43	4.83	28.99
LibriSpeech.test-other	52,882	5.25	8.95	41.34	6.86	12.42	44.77
TED-LIUM3	28,207	2.99	3.49	14.33	3.45	4.2	17.86
CommonVoice5.1	151,181	7.18	13.51	46.85	9.99	22.2	55.00
CallHome	20,521	15.53	16.53	6.05	16.35	19.4	15.72
Switchboard	19,863	11.78	11.24	-4.80	11.86	12.31	3.66
WSJ	33,273	3.72	3.88	4.12	4.51	4.11	-9.73
CORAAL	217,212	12.68	15.78	19.65	14.65	22.33	34.39
AMI-IHM	89,509	9.25	12.36	25.16	11.6	16.88	31.28
AMI-SDM1	89,509	25.39	31.29	18.86	29.02	39.22	26.01
Fleurs.en	7,907	4.65	7.3	36.30	5.93	9.66	38.61
VoxPopuli.en	44,079	7.21	7.96	9.42	7.89	8.91	11.45
Artie	14,729	5.34	10.01	46.65	7.12	17.05	58.24
CHiME6	44,521	32.34	35.15	7.99	32.4	40.72	20.43
Long-form Transcription
TED-LIUM3	28,205	6.44	6.9	6.67	6.88	7.61	9.59
Meanwhile	10,497	5.21	7.1	26.62	6.72	8.98	25.17
Rev16	194,975	9.77	10.99	11.10	10.52	13.25	20.60
Kincaid46	20,537	7.88	9.11	13.50	8.7	11.87	26.71
Earnings-21	359,748	8.06	9.39	14.16	8.9	11.51	22.68
Earnings-22	986,396	11.08	13.84	19.94	12.12	19.79	38.76
CORAAL	1,444,549	14.3	19.02	24.82	17.33	27.03	35.89

Table 1: A comparison of average word error rate (WER) for Ursa’s enhanced and standard models versus our previous release. The WER is significantly lower for Ursa with a 35% and 22% relative improvement for the standard and enhanced models, respectively. This means that using our latest enhanced model eliminates 1 in 5 errors made by our previous model.

Figure 1: Ursa’s enhanced model, denoted by Speechmatics, is more accurate than Amazon, Google, Microsoft, and OpenAI’s Whisper^††. Speechmatics achieves the lowest average word error rate (WER) of 11.97% across 21 open-source test sets which is a 38% relative lead compared to Google. The error bars represent the standard error across files.

Ursa’s enhanced model has unrivaled accuracy compared to other speech-to-text vendors with a 22% lead over Microsoft, the nearest large cloud provider (see Table 2). Meanwhile, Ursa’s standard model uses a smaller, self-supervised model to achieve higher throughput and can be used when speed is more desirable. However, it provides a 10% accuracy gain over Microsoft, meaning you can still expect market-leading accuracy even when using this operating point.

	Word Count	Speechmatics	Microsoft	Rel diff (%)	Amazon	Rel diff (%)	Google	Rel diff (%)	Whisper	Rel diff (%)
Weighted Average	3,911,327	11.96	15.39	22.31	16.91	29.29	19.33	38.14	15.95	25.05
Short-form Transcription
LibriSpeech.test-clean	53,027	2.44	3.33	26.73	4.35	43.91	5.06	51.78	2.93	16.72
LibriSpeech.test-other	52,882	5.25	6.99	24.89	8.16	35.66	10.98	52.19	5.58	5.91
TED-LIUM3	28,207	2.99	3.45	13.33	3.11	3.86	4.42	32.35	5.66	47.17
CommonVoice5.1	151,181	7.18	9.89	27.40	14.89	51.78	16.85	57.39	9.4	23.62
CallHome	20,521	15.53	17.06	8.97	12.88	-20.57	19.33	19.66	21.02	26.12
Switchboard	19,863	11.78	9.49	-24.13	9.35	-25.99	11.01	-6.99	22.33	47.25
WSJ	33,273	3.72	4.31	13.69	4.39	15.26	5.44	31.62	3.71	-0.27
CORAAL	217,212	12.68	14.80	14.32	14.05	9.75	17.71	28.40	14.96	15.24
AMI-IHM	89,509	9.25	13.93	33.60	13.12	29.50	16.49	43.91	18.38	49.67
AMI-SDM1	89,509	25.39	33.87	25.04	36.54	30.51	39.30	35.39	32.62	22.16
Fleurs.en	7,907	4.65	6.37	27.00	7.40	37.16	9.19	49.40	4.34	-7.14
VoxPopuli.en	44,079	7.21	9.00	19.89	9.27	22.22	11.77	38.74	8.18	11.86
Artie	14,729	5.34	6.82	21.70	10.69	50.05	13.23	59.64	6.5	17.85
CHiME6	44,521	32.34	36.50	11.40	26.29	-23.01	33.60	3.75	40.19	19.53
Long-form Transcription
TED-LIUM3	28,205	6.44	7.26	11.29	6.94	7.20	8.30	22.41	7.56	14.81
Meanwhile	10,497	5.21	8.10	35.68	9.16	43.12	9.72	46.40	5.23	0.38
Rev16	194,975	9.77	11.59	15.70	13.65	28.42	13.80	29.20	27.55	64.54
Kincaid46	20,537	7.88	9.70	18.76	10.94	27.97	12.52	37.06	8.7	9.43
Earnings-21	359,748	8.06	8.84	8.82	11.19	27.97	12.21	33.99	10.04	19.72
Earnings-22	986,396	11.08	13.31	16.75	14.75	24.88	16.88	34.36	13.29	16.63
CORAAL	1,444,549	14.30	19.84	27.92	21.81	34.43	24.81	42.36	18.51	22.74

Table 2: A comparison of the average word error rate (WER) for Ursa's enhanced model (denoted by Speechmatics) and other speech-to-text vendors on the market^‡‡. Speechmatics shows significant gains of between 22% versus Microsoft and 38% versus Google in relative accuracy improvement.

Recently, OpenAI conducted a study on human transcription^[7] and used four human transcription services to transcribe the Kincaid46 test set, consisting of news broadcasts, podcasts, meetings, and phone calls. They found the WER ranged from 8.14% to 10.5%. Remarkably, Ursa’s enhanced model achieves a lower WER of 7.88%, surpassing human-level accuracy in this domain. Results can be replicated in this notebook. We’re excited to have come a long way since 2018 when this dataset was released and our WER was 23%!

Our new Translation API

Our translation offering allows you to translate between English and 34 languages. When combined with our accurate transcripts, users receive the best translation package for speech on the market. We find that when building such a composite system, the accuracy in the underlying speech-to-text system significantly increases translation performance as shown by the higher BLEU^[8]scores in Table 3 and example output versus Google in Table 4. BLEU measures the similarity of the output to high quality reference transcripts, and the higher BLEU scores achieved by Ursa show the power of high-quality transcription in downstream tasks. We expect this trend to continue.

	WER ↓	BLEU ↑
Speechmatics	8.9	33.83
Google	18.88	30.61

Table 3: Speechmatics scores a higher BLEU score compared to Google on the CoVoST2^[9] dataset for translations from English to German. Speechmatics obtains a significantly lower word error rate (WER) than Google, aiding the translation service.

	Transcript	Translation
Speechmatics:	Did you give her the money?	Hast du ihr das Geld gegeben?
Google:	Did you keep hear the money?	Hast du ihrdas Geld gehört?

Table 4: An example that shows how errors from speech recognition impact the accuracy of translation. Words in red indicate the errors, with substitutions in italic and deletions ~~crossed out~~.

Best-in-Class

Ursa represents a quantum leap forward in speech technologies, setting a new standard for the speech-to-text industry. Our scaled-up self-supervised model, combined with the power of GPU-based computing, has allowed Speechmatics to achieve unmatched accuracy, speed, and downstream performance. Ursa is the clear choice for anyone seeking best-in-class speech recognition and translation, delivering on our promise to understand every voice.

Stay tuned for more exciting updates from Speechmatics as we continue to push hard to understand every voice. Next, we will be doing a deeper analysis of accuracy on different demographics, how speech-to-text accuracy is essential for downstream tasks, a detailed breakdown of translation accuracy, and how Ursa is robust to noisy conditions.


Footnotes	* Our quoted percentages are a relative reduction in word error rate (WER) across 21 open-source test sets when comparing two systems. To illustrate, a gain of 10% would mean that on average 1 in 10 errors are removed. WER is calculated by dividing the number of errors by the number of words in the reference, so a lower number indicates a better system. Ursa’s enhanced model is used for the comparisons unless stated otherwise. † Replicate our experiment that shows Speechmatics surpasses human-level transcription on the Kincaid46 dataset using this Python notebook. It uses our latest API so requires an API key which can be generated on our Portal. ‡ We would like to extend a special thanks to FluidStack who provided the infrastructure and a month of GPU training time to make this possible. ** We are aware of the limitations of WER, one major issue being that errors involving misinformation are given the same weight as simple spelling mistakes. To address this, we normalize our transcriptions to reduce penalties for differences in contractions or spelling between British and American English that humans would still consider correct. Going forward, we intend to adopt a metric based on NER. †† Tests conducted in January 2023 against Amazon Transcribe, Microsoft Azure Video Indexer, Google Cloud Speech-to-Text (latest_long model), and OpenAI’s Whisper (large-v2 model) compared to Ursa's enhanced model in the Speechmatics Batch SaaS. ‡‡ Our quoted numbers for Whisper large-v2 differ from the paper^[7] for a few reasons. Firstly, we found that the Whisper models tend to hallucinate, causing increases in WER due to many insertion errors as well as having non-deterministic behavior. Secondly, though we endeavored to minimize this, our preparation of some of these test sets may differ, but the numbers in the tables will show consistent comparisons.
References	[1] Kincaid, Jason. "Which Automatic Transcription Service Is the Most Accurate? - 2018." Medium, 5 Sept. 2018. Accessed 24 Feb. 2023. [2] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022). [3] Panayotov, Vassil, et al. "Librispeech: an asr corpus based on public domain audio books." 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015. [4] Del Rio, Miguel, et al. "Earnings-22: A Practical Benchmark for Accents in the Wild." arXiv preprint arXiv:2203.15591 (2022). [5] Kendall, Tyler, and Charlie Farrington. "The corpus of regional african american language." Version 6 (2018): 1. [6] Ardila, Rosana, et al. "Common voice: A massively-multilingual speech corpus." arXiv preprint arXiv:1912.06670 (2019). [7] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022). [8] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. [9] Wang, Changhan, Anne Wu, and Juan Pino. "Covost 2 and massively multilingual speech-to-text translation." arXiv preprint arXiv:2007.10310 (2020).
Author	John Hughes
Acknowledgements	Aaron Ng, Adam Walford, Ajith Selvan, Alex Raymond, Alex Wicks, Ana Olssen, Anand Mishra, Anartz Nuin, André Mansikkaniemi, Andrew Innes, Baskaran Mani, Ben Gorman, Ben Walker, Benedetta Cevoli, Bethan Thomas, Callum Hackett, Caroline Dockes, Chris Waple, Claire Schaefer, Daniel Nurkowski, David Agmen-Smith, David Gray, David Howlett, David MacLeod, David Mrva, Dominik Jochec, Dumitru Gutu, Ed Speyer, Edward Rees, Edward Weston, Ellena Reid, Gareth Rickards, George Lodge, Georgios Hadjiharalambous, Greg Richards, Hannes Unterholzner, Harish Kumar, James Gilmore, James Olinya, Jamie Dougherty, Jan Pesan, Janani T E, Jindrich Dolezal, John Hughes, Kin Hin Wong, Lawrence Atkins, Lenard Szolnoki, Liam Steadman, Manideep Karimireddy, Markus Hennerbichler, Matt Nemitz, Mayank Kalbande, Michal Polkowski, Neil Stratford, Nelson Kondia, Owais Aamir Thungalwadi, Owen O'Loan, Parthiban Selvaraj, Peter Uhrin, Philip Brown, Pracheta Phadnis, Pradeep Kumar, Rajasekaran Radhakrishnan, Rakesh Venkataraman, Remi Francis, Ross Thompson, Sakthy Vengatesh, Sathishkumar Durai, Seth Asare, Shuojie Fu, Simon Lawrence, Sreeram P, Stefan Fisher, Steve Kingsley, Stuart Wood, Tej Birring, Theo Clark, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy, Vyanktesh Tadkod, Waldemar Maleska, Will Williams, Wojciech Kruzel, Yahia Abaza. Special thanks to Will Williams, Harish Kumar, Georgina Robertson, Liam Steadman, Benedetta Cevoli, Emma Davidson, Edward Rees and Lawrence Atkins for reviewing drafts.

Introducing Ursa from Speechmatics

Our Approach

The Power of Scale

Measuring Ursa's ASR Accuracy

Our new Translation API

Best-in-Class

Related Articles

The Future of Word Error Rate

Whisper Deep-Dive and the Future of AI

How to Successfully Achieve Multinode Training in PyTorch