Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.

I think I fell asleep just reading that title. And yet, this new piece of research work direct from Google shows us some amazing new AI capabilities.

It also makes for an entertaining new game!

Meet Tacotron 2. A second iteration neural network built by Google Machine Learning engineers to synthesis ordinary written text into a natural, spoken word. This new program takes regular old sentences such as “This is your personal assistant Google Home.” and turns it into speech like below.

As you can imagine, this has huge benefits across the board. From aiding the blind, to giving AI’s a human voice and even allowing feedback via smart home speakers. Google has long worked on Text-to-Speech (TTS) and has recently seriously stepped up it’s game in making it more natural and human sounding.

Google Home

New Kid On The Block

Despite Googles recent upgrade in making synthesised speech sound more natural, this new version I think takes it pretty much to it’s conclusion. An indistinguishable machine voice.

Go ahead, try it out for yourself. Below are a few sentences that have both been spoken by a real person and also generated by the Tacotron 2 neural network. Listen to both and see if you can tell which audio file belongs to which. The game of “Tacotron 2 or Human?”

“That girl did a video about Star Wars lipstick.”

“She earned a doctorate in sociology at Columbia University.”

“George Washington was the first President of the United States.”

Would you like to know the answers?

I’d happily tell you… if I knew myself. And that’s the point. I can’t tell. According to their research, most people can’t tell the difference either.

Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.

The Future

Next up I think I’d like to see more emotion and personality built into TTS. Kind of like an emotional equaliser.

Emotional EQ

Would you prefer a happy go lucky AI voice? A sexy lady? Or someone who speaks soft and kindly?

They could also enable it to evolve over time or even react to your moods. Maybe the next Amazon Echo will learn that, you know, in the morning… you don’t really want someone super cheerful talking to you. You want it’s voice to be soft and generally neutral and to the point. After all, you did just wake up! Then when you get home on a Friday night it adapts to being more up beat, excited, happy and cracks jokes even.

Whatever it ends up developing into it’s great to finally see TTS rise to the level of natural human speech. It’s been a long route from the original “robot voice” you’d hear generated back in the 1980’s. A hearty congratulations to the entire team who contributed to this achievement!

If you’d like to hear some more samples or even read the full paper (which is available for free) head on over here. Also, Merry Christmas!

For the newer readers... if you’ve just bought a new DJI Drone or are interested in learning more about how to fly them and get professional videos... You'll probably like our awesome new course: DJI Drones: How To Become The Ultimate Pilot - Fly with confidence, get professional videos, stay safe and get in the air quickly!

The benefits include: 1) How to get those silky smooth videos that everyone loves to watch, even if you're new 2) How to fly your drone, from taking off to the most advanced flight modes 3) Clear outlines of how to fly with step-by-step instructional demonstrations and more 4) Why flying indoors often results in new pilots crashing their drone 5) What other great 3rd party apps are out there to get the most out of your drone 6) A huge mistake many pilots make when storing their drone in the car and how to avoid it 7) How to do all of these things whilst flying safely and within your countries laws.




Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *