Generative audio refers to the creation of

audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum *Digital audio, representation of sound ...

files from databases of audio clips. This technology differs from synthesized voices such as Apple's

Siri Siri ( , backronym: Speech Interpretation and Recognition Interface) is a digital assistant purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating sys ...

or Amazon's

Alexa Alexa may refer to: Technology *Amazon Alexa, a virtual assistant developed by Amazon * Alexa Internet, a defunct website ranking and traffic analysis service * Alexa Fluor, a family of fluorescent dyes * Arri Alexa, a digital motion picture ca ...

, which use a collection of fragments that are stitched together on demand. Generative audio works by using neural networks to learn the statistical properties of an audio source, then reproduces those properties.

Implications

With this technology, a person's voice can be replicated to speak phrases that they may have never spoken. This could lead to a synthetic version of a public figure's voice being used against them.

Technology

Modern generative audio systems employ various deep learning architectures. One notable approach uses

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

s (GANs), where two machine learning models work against each other to create realistic audio. Other architectures include WaveNet, which uses dilated causal convolutions to model raw audio waveforms, and implementations like 15.ai, which demonstrated in 2020 the ability to clone voices using as little as 15 seconds of training data through specialized neural network architectures.

References

{{reflist Sound production

Implications

Technology

See also

References