We built our new streaming TTS API for developers who need consistent, humanlike speech at low latency.
High latency kills conversation
Every 100ms matters in live conversation. Traditional TTS systems often introduce a 1 to 3 second delay, which is fine for static content but terrible for voice agents.
In demos, or batch, you might hear ultra-realistic voices, but in low-latency environments for the same voices, where responses need to be sub-300ms, vendors often switch to smaller or faster models and quality drops.
We have seen this first-hand with several partners building voice agents: the voice that initially wowed stakeholders can disappoint when integrated into the agent.
Our approach to text-to-speech
We built our TTS from day one with real-time streaming in mind.
The preview delivers speech with latency under 150 milliseconds without sacrificing naturalness, so you get fluid, human-like voices and the snappy response time needed for interactive applications.
Beyond English: Tackling robotic voices in other languages
Another big reason we built our own TTS is the language gap in voice synthesis.
We kept hearing this across languages like Dutch, French, and Thai from our clients with existing vendors.
One-size-fits-all systems, older techniques, or limited data for smaller languages lead to flat intonation and choppy cadence.
Making multilingual TTS more human
Speechmatics has a strong legacy in multilingual speech (our speech-to-text supports 55+ languages with high accuracy), and we’re bringing that expertise to TTS.
Our preview delivers a highly natural English voice now, and we’re actively working on more languages.
Our goal is to provide authentic voices in each language, capturing the nuances of local accents and speech patterns.
Building truly natural voices beyond English is a challenge, but we have a decade of experience building speech tech from the ground up.
More languages are coming, and we will not be satisfied until “robotic” is a thing of the past for all of them.
The hidden cost of talking at scale
TTS APIs charge per character or per minute of audio, and those costs add up quickly.
We’ve seen organizations hesitate to add voice everywhere or limit spoken content due to cost.
At millions of sentences or hours of speech, bills get hefty, which is not great for scaling or experimentation.
Keeping TTS costs grounded
We want TTS to be cost-effective, so you do not have to think twice about volume. By leveraging our own models and infrastructure, and being thoughtful about model size and optimization, we aim to keep costs reasonable and predictable.
We also simplify pricing and packaging alongside our speech-to-text: one contract, one platform. This simplifies integration, budgeting, and support.
Many customers need on-prem or edge TTS for sensitive use cases (think healthcare, finance) or connectivity reasons.
Our TTS engine can be used in our cloud, your cloud, or on your own servers next to our STT engine, so you can keep voice data in-house and meet latency requirements without sacrificing quality.
It slots into existing workflows, from cloud API prototypes to containerized models on thousands of devices worldwide.
Built on a decade of speech experience (and why that matters)
We are not starting from scratch. Speechmatics has spent over 10 years at the bleeding edge of speech technology, primarily in speech-to-text. TTS and STT are two sides of the same coin.
Our expertise in acoustic modelling, pronunciation, prosody, and handling different accents and noise conditions feeds directly into generating realistic speech. We are pouring that knowledge into our TTS models.
As one of the leading speech recognition companies, we have solved hard problems in multilingual audio and will now leverage that foundation to make synthesized speech inclusive and accurate.
Our mission has always been to understand every voice. Adding TTS was a natural step toward that mission. We built this preview to address real-world problems we saw again and again, expanding on our past R&D in speech. Imagine voice agents that truly sound human and respond in real time, in any language.
For developers, our API documentation shows how to integrate streaming TTS with just a few lines of code.
Spin up a voice agent demo, plug it into your call center software, or build that talking IoT device you have been imagining.
Where to use it today:
Real-time support agents that offer natural, clear speech
Healthcare whereprivate, offline speech is important
Live translation of events, media or conversations where speed is essential
These are just a few ideas.
We would love to see what you build.
Because this is a preview, we are actively seeking your feedback.
Does the voice quality meet your expectations?
How is the latency in your setup?
Are there languages or voice styles you want?
Let me know.
This is our chance to iterate with you and ensure the full product checks the right boxes. Voice interfaces are entering a new era of natural, real-time interaction and we want to be the foundation you build on.