The Best TTS Engine Since Sliced Bread!
Published: 2023-10-07
ElevenLabs
In this post we will look at the ElevenLabs Text-To-Speech (TTS) engine and its' integration into KlipMaker to power the speech synthesis in our text-to-video platform. In the next article we will touch on a more detailed comparison between ElevenLabs and Amazon Polly - an older generation TTS engine from Amazon.
ElevenLabs is a recent entry into the market. The company was officially started in 2022, meaning it was not around when KlipMaker was originally implemented. EleventLabs seems very well funded, and likely explains why it has quickly grown to become a serious competitor in the space. The firm was founded with an “ultimate goal of instantly converting spoken audio between languages”. They are not quite there yet, but the quality of voices is indeed really good and in many cases impossible to distinguish from real human speech.
ElevenLabs are taking a fundamentally different approach from other competitors, in using the latest proprietary AI models to create a generative voice AI engine. Quite possibly an industry first! This shows as the voices have much more character and emotion compared to other TTS engines and this sets ElevenLabs apart from the competition that would always sound just a bit too “monotone” to be 100% natural.
ElevenLabs offers a simple web UI to generate speech. Users can select the voice and there are a couple of parameters to tweak. Unlike some other services, there is no way to control individual parts of speech or words. Settings are always applied to the entire text that you wish to be synthesized in one go.
Interestingly, the speech system is clever enough to automatically detect foreign language and pronounce it correctly. Typically, other engines would read the foreign text with a very heavy American/English accent unless the user explicitly selects a voice that was trained on the foreign language. That seems to be much less of an issue for ElevenLabs. So much so that there is no option to explicitly select a language. If you try typing some foreign text, say Japanese, most voices will sound perfectly natural reading it without you having to explicitly tell the system that this is Japanese text. This helps to simplify the user experience and means the product will be easier to use for things like dubbing - where we want the same voice to be able to speak all languages equally well.
API Integration:
What really puts ElevenLabs above the competition, is the API. All other services that we have seen to-date, either do not offer one, offer it at a much higher price-point or the quality of the generated audio is just not that good compared to previous generation engines like Amazon-Polly.
The API provides a Python package and direct web endpoints for generating audio, getting a list of voices and creating your own custom voice. Having already integrated Amazon Polly TTS service into KlipMaker, we were able to quickly put together a prototype version of the ElevenLabs TTS integration using Node.js.
In the example above, we make a simple request to ElevenLabs API with the string of text that we want pronounced voiceover_params.text
. The services comes back with the audio data stream response.rawBody
. Interestingly, there doesn’t seem to be a way to have it tell you duration of the generated speech, or give any other sort of timing information. This should be an easy feature to add. It would be especially useful for things like doubling where you want exact timing information for each word in order to correctly sync it with video.
Pricing
Currently ElevenLabs offers a free plan that is enough to get a feel for their engine and is likely enough for infrequent use. They also have a number of paid plans ranging from $5 to $330 per month. API access is included in all plans, unlike many other services that leave this option out for all but the most expensive plans. As of this month, the two lowest plans are heavily discounted for the first-month.
ElevenLabs in KlipMaker:
To access the new voices in KlipMaker, simply specify them inside the script text area like so:
In the above example, the text will be read by ElevenLabs voice named "Callum". All default ElevenLabs voices are currently supported. You can get a full list on their website. A cleaner, more intuitive interface is in the works, after a few more exciting product features we currently have on our roadmap.
As you might have guessed from the title of this post, we are very exited about ElevenLabs. Can't wait for the day when we will be able to watch a film from any country and have all voices sound perfectly natural in English. In the meantime, addition of ElevenLabs is a great upgrade in our journey to build the ultimate AI text-to-video tool: KlipMaker.
Stay tuned for the next article, where we will take a deeper dive comparing ElevenLabs to Amazon Polly.