WaveNet is a deep

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

for generating raw audio. It was created by researchers at London-based AI firm

DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...

. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a

method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing

text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or Computer hardware, hardware products. A text-to-speech (TTS) system conv ...

(TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

History

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's

Siri Siri ( , backronym: Speech Interpretation and Recognition Interface) is a digital assistant purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating sys ...

, Microsoft's Cortana,

Amazon Alexa Amazon Alexa is a virtual assistant technology marketed by Amazon and implemented in software applications for smart phones, tablets, wireless smart speakers, and other electronic appliances. Alexa was largely developed from a Polish speech s ...

and the

Google Assistant Google Assistant is a virtual assistant software application developed by Google that is primarily available on home automation and mobile devices. Based on artificial intelligence, Google Assistant can engage in two-way conversations, unlike ...

. Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words. The most common of these is called concatenative TTS. It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone. The reliance on a recorded library also makes it difficult to modify or change the voice. Another technique, known as parametric TTS, uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a

vocoder A vocoder (, a portmanteau of ''vo''ice and en''coder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation. The vocoder wa ...

. This can also result in unnatural sounding audio.

Design and ongoing research

Background

WaveNet is a type of

feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

known as a deep

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

(CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a

softmax The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

(i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values.

Initial concept and results

According to the original September 2016 DeepMind research paper ''WaveNet: A Generative Model for Raw Audio'', the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language. WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech. The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like

classical music Classical music generally refers to the art music of the Western world, considered to be #Relationship to other music traditions, distinct from Western folk music or popular music traditions. It is sometimes distinguished as Western classical mu ...

Content (voice) swapping

According to the June 2018 paper ''Disentangled Sequential

Autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function ...

'', DeepMind has successfully used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio recording for another, pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa '' ..'." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "'' 'n advantage of the model is that it separates dynamical from static features '' ..'." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other. The January 2019 follow-up paper ''Unsupervised speech representation learning using WaveNet autoencoders'' details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, ''Sample Efficient Adaptive Text-to-Speech'', dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results. Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016

BBC The British Broadcasting Corporation (BBC) is a British public service broadcaster headquartered at Broadcasting House in London, England. Originally established in 1922 as the British Broadcasting Company, it evolved into its current sta ...

article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis.

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate

voices for US English and Japanese across all Google platforms. In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation". At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples.

References

External links

WaveNet: A Generative Model for Raw Audio
{{Artificial intelligence navbox Speech synthesis Artificial neural networks Google acquisitions Google