Using WaveNet to generate human voice
As part of my independent study in ACCAD at OSU, I'm working with where machine learning meets art. In this experiment, I use Google DeepMind's WaveNet architecture implemented by Github user ibab. Using his Tensorflow implementation and the audio from VCTK Corpus, I've trained the network to gather the "essence" of human voice.
Since the corpus has samples from many different dialects and accents, the network becomes rather robust. The network also allows for global conditioning, so when generating voice, I can choose a sample to base it off of. For instance, if I add a few samples of my own voice, I can have the network generate audio in the style of my own voice.
As training continues, I'll add more examples of the output as it moves from white noise to babbling. Pay attention to the step # in the title of the clip; as more training has happened, the network has hopefully "learned" more. Step zero is white noise, step 5000 sounds like it's starting to understand what makes a voice sound like a voice, and step 40000 is a bit better, but still making many non-human sounding changes in pitch and speed. Networks that have trained much longer capture things we normally filter out, like the sound of a mouth opening or breathing between words. Feel free to look up other examples!
This neural network is what Google improved on to create WaveRNN, the language synthesizer that is used inside of Android phones today! They can now generate speech at 4x realtime, meaning 4 seconds of audio can be generated in 1 second on a phone's processor. Google assistant takes advantage of this in order to generate speech while offline. This speed up is many orders of magnitude faster than this WaveNet implementation. For the samples above, generating 1 second of audio took upwards of 15 minutes. This shows just how active and exciting AI research currently is - we've moved from something impractical to speedy and efficient in under 2 years. Click here to view the source code!