At Last, Google’s DeepMind AI Can Make Machines Sound Like Humans

At Last, Google’s DeepMind AI Can Make Machines Sound Like Humans

Google has announced WaveNet, a speech synthesis program that uses AI and deep learning techniques to generate speech samples better than current technologies.

By analyzing samples 16,000 a second, it can generate human-like speech and even its own music compositions.


If you’ve ever been lost in the maze of Youtube videos you may have stumbled on clips of computers reading news articles.

You’d recognize that staccato, robotic nature of the voice. We’ve come a long way from “Danger! Will Robinson!,” but it there is yet to be a computer that can seamlessly mimic a human voice.

Now, there’s a new contender, brought to you by the brilliant minds behind DeepMind. Google has announced a new voice synthesis program in WaveNet, powered by deep neural AI.

Understanding voice samples has been powering programs like Google Voice Search for quite some time now.

However, synthesizing something from those samples is proving to be quite a challenge.

The most prominent method to do that right now is concatenative TTS (text-to-speech). It combines fragments of recorded speech together.

The major drawback is this method can’t modify the fragments to create something new, resulting in the stilted “robotic” voice. Another method is parametric TTS, which passes speech through a vocoder, producing even less natural speech.


Google’s WaveNet uses a completely different approach. Instead of simply analyzing the audio its fed, it learns from them, similar to how many deep neural systems work.

By working with at least 16,000 samples per second, WaveNet can generate its own raw audio samples.

Image credit: DeepMind

And it can do this without much human intervention; it uses statistics to actually predict which audio piece it needs,what it has to “say” next.

Want to take a listen for yourself? The announcement post has several voice samples in both English and Mandarin Chinese.

The system is also able to synthesize its own music, since it can analyse any sound patterns and not just speech.

You can also listen to samples of the original compositions. Perhaps most impressively, the system is also able to synthesize speech without input.

Where TTS always requires input as instruction, WaveNet is able to create speech sound without a road map.

Granted, the result is just a string of nonsense sounds but it also contains the sounds of mouth movements and breathing.

This indicates the exciting potential of the system to create the most realistic computer voices.

References: DeepMind, Fortune

AUTHOR: Jelor Gallego

Date: September 9, 2016
EDITOR: Patrick Caughill

Reblogged from Futurism with thanks @

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s