A language model is a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

over sequences of words. Given any sequence of words of length , a language model assigns a probability

P(w_1,\ldots,w_m)

to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of

digital infinity Digital infinity is a technical term in theoretical linguistics. Alternative formulations are "discrete infinity" and "the infinite use of finite means". The idea is that all human languages follow a simple logical principle, according to which a li ...

), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the

Markov assumption In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov prop ...

or using neural architectures such as

recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

s or

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Takara Tomy. It primarily follows the Autobots and the Decepticons, two alien robot factions at war that can transform into other forms, ...

. Language models are useful for a variety of problems in

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

; from initial applications in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

to ensure nonsensical (i.e. low-probability) word sequences are not predicted, to wider use in

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates ...

(e.g. scoring candidate translations),

natural language generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics th ...

(generating more human-like text),

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...

parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from ...

,Andreas, Jacob, Andreas Vlachos, and Stephen Clark.
Semantic parsing as machine translation
" Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.

Optical Character Recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other de ...

, grammar induction,

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...

, and other applications. Language models are used in information retrieval in the query likelihood model. There, a separate language model is associated with each

document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...

in a collection. Documents are ranked based on the probability of the query ''Q'' in the document's language model

M_d

P(Q\mid M_d)

. Commonly, the

unigram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...

language model is used for this purpose.

Model types

Unigram

A unigram model can be treated as the combination of several one-state

finite automata A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...

. It assumes that the probabilities of tokens in a sequence are independent, e.g.: :

P_\text(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).

In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document. :

\sum_ P(\text) = 1

The probability generated for a specific query is calculated as :

P(\text) = \prod_ P(\text)

Different documents have unigram models, with different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents: In information retrieval contexts, unigram language models are often smoothed to avoid instances where ''P''(term) = 0. A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to smooth the model.

n-gram

In an ''n''-gram model, the probability

P(w_1,\ldots,w_m)

of observing the sentence

w_1,\ldots,w_m

is approximated as :

P(w_1,\ldots,w_m) = \prod^m_ P(w_i\mid w_1,\ldots,w_)\approx \prod^m_ P(w_i\mid w_,\ldots,w_)

It is assumed that the probability of observing the ''i^th'' word ''w_i'' in the context history of the preceding ''i'' − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding ''n'' − 1 words (''n''^th order

Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov propert ...

). To clarify, for ''n=3'' and ''i=2'' we have

P(w_i\mid w_,\ldots,w_)=P(w_2\mid w_1)

. The conditional probability can be calculated from ''n''-gram model frequency counts: :

P(w_i\mid w_,\ldots,w_) = \frac

The terms bigram and trigram language models denote ''n''-gram models with ''n'' = 2 and ''n'' = 3, respectively. Typically, the ''n''-gram model probabilities are not derived directly from frequency counts, because models derived this way have severe problems when confronted with any ''n''-grams that have not been explicitly seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or ''n''-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an

uninformative prior In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

) to more sophisticated models, such as Good-Turing discounting or back-off models.

Bidirectional

Bidirectional representations condition on both pre- and post- context (e.g., words) in all layers.

Example

In a bigram (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as :

P(\text) \approx P(\text\mid\langle s\rangle) P(\text\mid \text) P(\text\mid\text) P(\text\mid\text) P(\text\mid\text) P(\langle /s\rangle\mid \text)

whereas in a trigram (''n'' = 3) language model, the approximation is :

P(\text) \approx P(\text\mid \langle s\rangle,\langle s\rangle) P(\text\mid\langle s\rangle,I) P(\text\mid\text) P(\text\mid\text) P(\text\mid\text) P(\langle /s\rangle\mid\text)

Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted . Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''*I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''

Exponential
Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is : $P(w_m \mid w_1,\ldots,w_) = \frac \exp (a^T f(w_1,\ldots,w_m))$ where $Z(w_1,\ldots,w_)$ is the partition function, $a$ is the parameter vector, and $f(w_1,\ldots,w_m)$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on $a$ or some form of regularization. The log-bilinear model is another example of an exponential language model.

Neural network
Neural language models (or ''continuous space language models'') use continuous representations or embeddings of words to make their predictions. These models make use of
Neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
. Continuous space embeddings help to alleviate the
curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...
in language modeling: as language models are trained on larger and larger texts, the number of unique words (the vocabulary) increases. The number of possible sequences of words increases exponentially with the size of the vocabulary, causing a data sparsity problem because of the exponentially many sequences. Thus, statistics are needed to properly estimate probabilities. Neural networks avoid this problem by representing words in a
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
way, as non-linear combinations of weights in a neural net. An alternate description is that a neural net approximates the language function. The neural net architecture might be
feed-forward Feedforward is the provision of context of what one wants to communicate prior to that communication. In purposeful activity, feedforward creates an expectation which the actor anticipates. When expected experience occurs, this provides confirmato ...
or recurrent, and while the former is simpler the latter is more common. Typically, neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution : $P(w_t \mid \mathrm) \, \forall t \in V$ . I.e., the network is trained to predict a probability distribution over the vocabulary, given some linguistic context. This is done using standard neural net training algorithms such as
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
with
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
. The context might be a fixed-size window of previous words, so that the network predicts : $P(w_t \mid w_, \dots, w_)$ from a
feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
representing the previous words. Another option is to use "future" words as well as "past" words as features, so that the estimated probability is : $P(w_t \mid w_, \dots, w_, w_, \dots, w_)$ . This is called a bag-of-words model. When the feature vectors for the words in the context are combined by a continuous operation, this model is referred to as the continuous bag-of-words architecture (CBOW). A third option that trains slower than the CBOW but performs slightly better is to invert the previous problem and make a neural network learn the context, given a word. More formally, given a sequence of training words $w_1, w_2, w_3, \dots, w_T$ , one maximizes the average log-probability : $\frac\sum_^T \sum_ \log P(w_ \mid w_t)$ where , the size of the training context, can be a function of the center word $w_t$ . This is called a
skip-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or ...
language model. Bag-of-words and skip-gram models are the basis of the
word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
program. Instead of using neural net language models to produce actual probabilities, it is common to instead use the distributed representation encoded in the networks' "hidden" layers as representations of words; each word is then mapped onto an -dimensional real vector called the
word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
, where is the size of the layer just before the output layer. The representations in skip-gram models have the distinct characteristic that they model semantic relations between words as linear combinations, capturing a form of
compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
. For example, in some such models, if is the function that maps a word to its -d vector representation, then : $v(\mathrm) - v(\mathrm) + v(\mathrm) \approx v(\mathrm)$ where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.

Other
A positional language model assesses the probability of given words occurring close to one another in a text, not necessarily immediately adjacent. Similarly, bag-of-concepts models leverage the semantics associated with multi-word expressions such as ''buy_christmas_present'', even when they are used in information-rich sentences like "today I bought a lot of very nice Christmas presents". Despite the limited successes in using neural networks, authors acknowledge the need for other techniques when modelling sign languages.

Notable language models
Notable language models, include: * Pathways Language Model (PaLM) 540 billion parameter model, from
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
Research. * Generalist Language Model (GLaM) 1 trillion parameter model, from Google Research * Language Models for Dialog Applications (LaMDA) 137 billion parameter model from Google Research * Megatron-Turing NLG 530 billion parameter model, from Microsoft/Nvidia * DreamFusion/Imagen 3D image generation from Google Research * Get3D from Nvidia * MineClip from Nvidia * BLOOM: BigScience Large Open-science Open-access Multilingual Language Model with 176 billion parameters. *
GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while someti ...
: Generative Pre-trained Transformer 2 with 1.5 billion parameters. *
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standar ...
: Generative Pre-trained Transformer 3, with the unprecedented size of 2048-token-long context and 175 billion parameters (requiring 800 GB of storage). * GPT-3.5/ ChatGPT/InstructGPT from OpenAI * BERT: Bidirectional Encoder Representations from Transformers (BERT) * GPT-NeoX-20B: An Open-Source Autoregressive Language Model with 20 billion parameters. * OPT-175B by Meta AI: another 175-billion-parameter language model. It is available to the broader AI research community. * Point-E by OpenAI: a 3D model generator.
Hugging Face Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users ...
hosts a set of publicly available language models for developers to build applications using
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.

Evaluation and benchmarks
Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the rate of learning, e.g. through inspection of learning curves. Various data sets have been developed to use to evaluate language processing systems. These include: * Corpus of Linguistic Acceptability * GLUE benchmark * Microsoft Research Paraphrase Corpus * Multi-Genre Natural Language Inference * Question Natural Language Inference * Quora Question Pairs * Recognizing Textual Entailment * Semantic Textual Similarity Benchmark * SQuAD question answering Test * Stanford Sentiment
Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
* Winograd NLI

Criticism
Although contemporary language models, such as GPT-2, can be shown to match human performance on some tasks, it is not clear they are plausible cognitive models. For instance, recurrent neural networks have been shown to learn patterns humans do not learn and fail to learn patterns that humans do learn.

See also
*
Statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
* Factored language model * Cache language model * Katz's back-off model *
Transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
* BERT

Notes

References

Citations

Sources
* * * {{Natural language processing * Statistical natural language processing Markov models