Whisper is a machine learning model for

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

and

transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, the fir ...

, created by

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

and first released as

open-source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Ope ...

in September 2022. It is capable of transcribing speech in English and several other languages, and is also capable of translating several non-English languages into English. OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches. Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture. Whisper Large V2 was released on December 8, 2022. Whisper Large V3 was released in November 2023, on the OpenAI Dev Day.

Background

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

has had a long history in research; the first approaches made use of statistical methods, such as

dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walki ...

, and later

hidden Markov models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

. At around the 2010s,

deep neural network Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

approaches became more common for speech recognition models, which were enabled by the availability of large datasets (" big data") and increased computational performance. Early approaches to deep learning in speech recognition included

convolutional neural networks In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...

, which were limited due to their inability to capture sequential data, which later led to developments of

Seq2seq Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models and text summarization. History The algorithm was proposed by Tomáš Mik ...

approaches, which include

recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

which made use of

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

. Transformers, introduced in 2017 by

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

, displaced many prior state-of-the art approaches to many problems in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, and started becoming the core neural architecture in fields such as

language modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

and

computer vision Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...

; weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks. According to a

NYT ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...

report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train their

large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task.

Architecture

The Whisper architecture is based on an encoder-decoder

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

. Input audio is resampled to 16,000 Hz and converting to an 80-channel log-magnitude Mel spectrogram using 25 ms windows with a 10 ms stride. The spectrogram is then normalized to a

1, 1 Onekama ( ) is a village in Manistee County in the U.S. state of Michigan. The population was 411 at the 2010 census. The village is located on the shores of Portage Lake and is surrounded by Onekama Township. The town's name is derived from "O ...

range with near-zero mean. The encoder takes this Mel spectrogram as input and processes it. It first passes through two convolutional layers. Sinusoidal positional embeddings are added. It is then processed by a series of Transformer encoder blocks (with pre-activation residual connections). The encoder's output is layer normalized. The decoder is a standard Transformer decoder. It has the same width and Transformer blocks as the encoder. It uses learned positional embeddings and tied input-output token representations (using the same weight matrix for both the input and output embeddings). It uses a byte-pair encoding tokenizer, of the same kind as used in

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometim ...

. English-only models use the GPT-2 vocabulary, while multilingual models employ a re-trained multilingual vocabulary with the same number of words. Special tokens are used to allow the decoder to perform multiple tasks: * Tokens that denote language (one unique token per language). * Tokens that specify task (<, transcribe, > or <, translate, >). * Tokens that specify if no timestamps are present (<, notimestamps, >). If the token is not present, then the decoder predicts timestamps relative to the segment, and quantized to 20 ms intervals. * <, nospeech, > for voice activity detection. * <, startoftranscript, >, and <, endoftranscript, > . Any text that appears before <, startoftranscript, > is not generated by the decoder, but given to the decoder as context. Loss is only computed over non-contextual parts of the sequence, i.e. tokens between these two special tokens.

Data

The training dataset consists of 680,000 hours of labeled audio-transcript pairs sourced from the internet. This includes 117,000 hours in 96 non-English languages and 125,000 hours of X→English translation data, where X stands for any non-English language. Preprocessing involved standardization of transcripts, filtering to remove machine-generated transcripts using heuristics (e.g.,

punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...

capitalization Capitalization (American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term ...

language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, sol ...

and matching with transcripts,

fuzzy Fuzzy or Fuzzies may refer to: Music * Fuzzy (band), a 1990s Boston indie pop band * Fuzzy (composer) (born 1939), Danish composer Jens Vilhelm Pedersen * ''Fuzzy'' (album), 1993 debut album by the Los Angeles rock group Grant Lee Buffalo * "Fu ...

deduplication The term deduplication refers generally to eliminating duplicate or redundant information. *Data deduplication, in computer storage, refers to the elimination of redundant data *Record linkage Record linkage (also known as data matching, data l ...

, and deduplication with evaluation datasets to avoid data contamination. Speechless segments were also included, to allow

voice activity detection Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speech coding and speech recognition. I ...

training. For the files still remaining after the filtering process, audio files were then broken into 30-second segments paired with the subset of the transcript that occurs within that time. If this predicted spoken language differed from the language of the text transcript associated with the audio, that audio-transcript pair was not used for training the speech recognition models, but instead for training translation.

Post-training filtering

After training the first model, they ran the trained model on was calculated on different subsets of the training data (each representing a distinct source). Data sources were ranked by a combination of their error rate and size. Manual inspection of the top-ranked sources (high error, large size) helped determine if the source was low quality (e.g., partial transcriptions, inaccurate alignment). Low-quality sources were then removed.

Training

Whisper has been trained using

semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...

on 680,000 hours of multilingual and multitask data, of which about one-fifth (117,000 hours) were non-English audio data. After training, it was fine-tuned to suppress the prediction of speaker names. It was trained by AdamW optimizer with gradient norm clipping and a linear learning rate decay with warmup, with batch size 256 segments. Training proceeds for 1 million updates (2-3 epochs). No data augmentation or regularization, except for the Large V2 model, which used SpecAugment, Stochastic Depth, and BPE Dropout. Training used

data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures lik ...

with

float16 In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in application ...

, dynamic loss scaling, and activation checkpointing.

Capacity

Whisper does not outperform models which specialize in the LibriSpeech

dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...

, although when tested across many datasets, it is more

robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...

and makes 50% fewer errors than other models. Whisper has a differing error rate with respect to transcribing different languages, with a higher

word error rate Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from ...

in languages not well-represented in the training data. The authors found that

multi-task learning Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction ac ...

improved overall performance compared to models specialized to one task. They conjectured that the best Whisper model trained is still underfitting the dataset, and larger models and longer training can result in better models. Third-party evaluations have found varying levels of AI hallucination. A study of transcripts of public meetings found hallucinations in eight out of every 10 transcripts, while an engineer discovered hallucinations in "about half" of 100 hours of transcriptions and a developer identified them in "nearly every one" of 26,000 transcripts. A study of 13,140 short audio segments (averaging 10 seconds) found 187 hallucinations (1.4%), 38% of which generated text that could be harmful because it inserted false references to things like race, non-existent medications, or violent events that were not in the audio.

Applications

The model has been used as the base for many applications, such as a unified model for speech recognition and more general

sound recognition Sound recognition is a technology, which is based on both traditional pattern recognition theories and audio signal analysis methods. Sound recognition technologies contain preliminary data processing, feature extraction and classification algo ...

References

{{Artificial intelligence navbox Speech recognition software OpenAI Applications of artificial intelligence 2022 software