Apr 12, 2023 | Read time 7 min

Boosting sample efficiency through Self-Supervised Learning

Our latest Ursa release was able to achieve incredible accuracy partly through scaling self-supervised learning. In this blog we demonstrate the power of self-supervised learning and challenge the assumption that scaling labeled data is the key to greater accuracy. We show that with 300x less the amount of labeled data we still beat the nearest vendor by 12% relative.
Self-Supervised Learning
Bethan Thomas
Bethan ThomasSenior Machine Learning Engineer
Sample Efficiency

Figure 1: Left: a simplified diagram of a traditional ASR system mapping input speech features directly to output labels. Right: Intermediate layer representations from the SSL model are fed into the acoustic model as input. The final projection layer of the SSL model is ignored.

Figure 2: The word error rate (WER) of two SSL models of different parameter size. The plot demonstrates how WER varies with the amount of labeled training data. For the larger model, the rate of improvement is slower, showing the diminishing returns of training on more data with a more powerful SSL model. The absolute difference shows that the larger model is generally better performing, even as labeled data drastically decreases.

References [1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[2] Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. "Listen, attend and spell." arXiv preprint arXiv:1508.01211 (2015).

[3] Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).

AuthorsBethan Thomas
AcknowledgementsBenedetta Cevoli, John Hughes, Will Williams