ElevenLabs VS Amazon Polly
Published: 2023-10-07
Overview
This is part 2 of our in-depth evaluation of ElevenLabs. Last month we went over ElevenLabs TTS Engine, its capabilities, limitations and how it is used to power our text-to-video AI conversion process. Today we are putting it head-to-head with the incumbent 900-pound gorilla of the TTS space: AWS Polly.
Some History
Amazon Polly was introduced in 2016. At the time, it was on the cutting edge of TTS capabilities. It supports over 30 languages, up-to millisecond timing information, and SSML syntax for being able to provide fine control over things like reading speed and phonetic pronunciation of each individual word.
Over the years, large competitors like Google and Microsoft followed up with very similar products and features. Eventually, these became the de-facto standard voices in various video or other applications.
Unfortunately, the quality of speech synthesis remained average. Almost any voice generate by these offerings sounds robotic, lacking human character and is easy to identify. A few years ago, in part driven by continuing developments in AI models, a new generation of small startups started to appear. They typically don’t offer the full breadth of features in Amazon Polly, Google or Microsoft versions, but instead they have a much more realistic speech synthesis engine. So much so, that many of these voices are very hard if not impossible to distinguish from real human speech.
Enter ElevenLabs
Below is a table comparing Amazon Polly and ElevenLabs
Comparison | Amazon Polly | ElevenLabs |
---|---|---|
Quality | Average | Excellent |
Timing Information | Yes | No |
Support for SSML Syntax | Yes | No |
Number of Voices | 90+ | 40+ |
Support Custom Voices | No* | Yes |
Price | Very Cheap | Medium |
Advantages of ElevenLabs
Quality
The main advantage of ElevenLabs is plain and simple pure quality. The vast majority of generated speech sounds completely natural and near impossible to tell apart from real human speech.
Quantity
While the total number of voices they provide is not that high, remember that most of those voices are independent of the language. While Amazon Polly provides a total of over 90 voices, only about a dozen of those would be able to speak English and if you are looking at other languages, say Japanese, then you are down to just a couple of options. With ElevenLabs, the majority of the 40+ voices will no trouble speaking Japanese.
Custom voices
Another advantage of ElevenLabs is the ability to easily clone your own voice or to generate new custom voices. To give credit to Amazon, they do offer an ability to create a “Brand Voice” but that requires a custom engagement with them to build it and is not accessible to anyone but the largest companies.
Disadvantages of ElevenLabs
Speechmarks and Timing Information:
Speechmarks is the name given to special information about an audio file which provides exact timing down to the millisecond of every phonetic sound being spoken. This is really great for generating any sort of audio waveform animations or doing video and subtitle lip syncing. It seems strange that ElevenLabs does not offer this today, especially considering many of the voices are clearly targeting narration inside video-games, but my guess is that they just have not prioritized it yet.
SSML
Speech Synthesis Markup Language (SSML) is commonly used by many TTS vendors. It lets the user specify detailed instructions about how the text is to be read. For example, you can specify pronunciation for an emoji, add an extra pause between sentences or tell it to emphasize a specific word. At the time of this writing, ElevenLabs does not support SSML. While they do provide support for some simple things like adding an extra pause, the documentation is lacking and at times feels like even they do not fully understand the capabilities of their own audio synthesis engine.
Price
While cheaper than many competitors, ElevenLabs is not “cheap”. They offer a free tier for anyone interested in trying out the product. However, for any large scale project, the price is going to be more than 10x that of Amazon Polly. Is it worth it? That is for you to decide.
Final Remarks
The field of TTS speech-synthesis has progressed leaps-and-bound in just a few short years. I expect prices to go down quickly as various firms start to catch up on quality and it becomes harder to differentiate. It would be interesting to see if Amazon upgrades their Polly service or decides to acquire one of the competing startup firms in this space.