I tested 3 text-to-speech AI models to see which is best - hear my results

I experimented with three leading text-to-speech AI models - here's what I found — Elyse Betters Picaro / ZDNET

ZDNET’s key takeaways

There are now several AI tools available that can generate humanlike speech.
Some AI voices can now whisper, laugh, and perform other expressive feats.
TTS tools vary in terms of their level of realism and their intended audiences.

Synthetic voices generated by artificial intelligence are, for better or worse, becoming commonplace. Meanwhile, the number of companies developing this technology is growing rapidly.

Recent innovations in AI, such as the transformer architecture — which forms the backbone of many generative AI tools, including large language models, generative adversarial networks (GANs), and diffusion models — have led to the rise of AI systems that can convert text prompts into natural-sounding artificial speech. There are now a wide variety of these text-to-speech (TTS) systems available, each with its particular benefits and shortcomings.

To gain a clearer sense of which are the most advanced, I tested three of the most popular free TTS tools currently on the market.

ElevenLabs

ElevenLabs is widely considered an industry leader in voice realism, and I found this to be a reasonably accurate assessment in my own experiments with the company’s TTS tool. But that realism feels more closely aligned with the voice of a trained voice actor or professional podcaster than it does with ordinary human conversation — it’s almost a little too polished. In that sense, however, it tends to be the preferred choice for many businesses and professionals looking for reliable automated narration. It also supports more than 20 languages, further expanding the platform’s reach and appeal.

The company also released a new text-to-speech model called v3 as a research preview last month. It supports more than 70 languages, and users can spice up their AI-generated dialogue with audio tags that cause it to laugh, sigh, or speak in a whisper, to name just a few examples.

Also: ElevenLabs’ new AI voice assistant can automate your favorite tasks — and you can try it for free

You can sign up for a free account with ElevenLabs, and you’ll automatically receive 10,000 free credits. Select the “Text to Speech” option under “Playground” in the left-hand menu, and you’ll be redirected to a page where you can enter a custom prompt you’d like the AI system to narrate, select from a range of custom voices, and adjust parameters like speed and stability. Prompts are limited to 5,000 characters, and every character in each iteration of a voice generation uses a single credit.

Hume AI

Hume AI‘s TTS model is another contender for the most realistic voice-generating tool. The company has positioned its proprietary Empathic Voice Interface (EVI) as an AI system that can capture and simulate the subtleties of human speech, imbuing it with a deeper layer of believability. Like ElevenLabs, Hume offers a broad set of premade AI voice characters, each with its own expressive quirks. You can also generate custom voices by describing them in natural-language prompts.

To test it out, I did my best to describe the voice of Samwise Gamgee from “The Lord of the Rings,” as portrayed in the films by Sean Astin. My prompt: “Gentle but brave hobbit, with a working-class, West Country British — possibly with a hint of Welsh — accent. He should sound frightened but resolved to complete his mission.”

Also: This new text-to-speech AI model understands what it’s saying – how to try it for free

After I prompted it to say a famous line from the film, “If I take one more step, it’ll be the furthest away from home I’ve ever been,” it produced three samples, varying in tone and emphasis. All of them were impressive; to my ear, they contained a degree of realism and emotional depth that isn’t replicable by its competitors. They didn’t sound much like Astin’s Sam, but that was undoubtedly a reflection of the admittedly imperfect description I used as a prompt.

You can also pepper pauses by adding “[pause]” into your prompt, or add slangy infusions like “y’all” to enhance the believability of your custom voices.

Descript

If you’re looking for an AI voice-generating tool that offers a range of editing features, Descript is the one to choose.

The company’s TTS model generates audio files in a waveform format, which you can edit just as you would in Adobe Audition or a similar platform. You can choose from a library of premade AI voices or submit a short recording of your own voice, and the system will clone it for you.

I tested the voice-cloning feature by asking the system to read a short prompt: “Summers in New York City are getting brutal, and I need to invest in more high-quality air conditioning.” (Which is true.) The first time around, the AI-generated version of my voice definitely sounded like me, but there was also a mechanical quality that detracted from the realism.

I decided to give it another try and re-record my voice, this time taking off my Bluetooth headphones and reading the script more slowly and deliberately. The results this time were much more realistic — a more convincing simulation of my voice, in my opinion, than a similar voice-cloning feature offered by Hume.

Also: I spoke with an AI version of myself, thanks to Hume’s free tool – how to try it

You can also adjust each piece of AI-generated audio by directly editing your written prompt. It wasn’t perfect, of course; my close friends and family members would probably be able to spot the difference, but it would likely fool my more distant acquaintances. I can easily imagine using the tool to narrate my own articles or for some similar use case.

For podcasters and other content creators looking to quickly polish their audio recordings, Descript also offers an AI feature that identifies and eliminates filler words, unnecessary pauses, “umms” and “uhhs,” and other unwanted bits of audio.

ZDNET’s advice

It’s important to bear in mind that these are just three of a huge number of TTS models currently available, and that each user will have their own preferences based on their professional role, tech savviness, budget, and so on. Before you choose a platform and run with it, spend a few minutes playing with different options to see which user interfaces feel most intuitive and which ones offer features that align most closely with your creative goals. Also remember that services vary in how they use your data.

Also: Text-to-speech with feeling – this new AI model does everything but shed a tear

Regardless of which platform you end up using, keep your eye on the speed at which this technology continues to evolve. Very soon, we’ll likely be living in a world filled with AI voices — and some of them could sound just like your own.

Want more stories about AI? Check out AI Leaderboard, our weekly newsletter.

Source link

I tested 3 text-to-speech AI models to see which is best – hear my results

ZDNET’s key takeaways

ElevenLabs

Hume AI

Descript

ZDNET’s advice

Leave a Comment Cancel reply