A language model is a

model A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , . Models can be divided in ...

of the human brain's ability to produce

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

. Language models are useful for a variety of tasks, including

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)
"Semantic parsing as machine translation"
. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).

natural language generation Natural language generation (NLG) is a software process that produces natural language output. A widely cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the ...

(generating more human-like text),

optical character recognition Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...

, route optimization,

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens ...

grammar induction Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of ''re-write rules'' or '' productions'' or alternatively as a finite-state machine or automaton of some kind) ...

, and

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

Large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s (LLMs), currently their most advanced form, are predominantly based on

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

trained on larger datasets (frequently using words scraped from the public

internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...

). They have superseded

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

-based models, which had previously superseded the purely statistical models, such as word ''n''-gram language model.

History

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American professor and public intellectual known for his work in linguistics, political activism, and social criticism. Sometimes called "the father of modern linguistics", Chomsky is also a ...

did pioneering work on language models in the 1950s by developing a theory of

formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...

s. In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like word ''n''-gram language models, with probabilities for discrete combinations of words, made significant advances. In the 2000s, continuous representations for words, such as word embeddings, began to replace discrete representations. Typically, the representation is a

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...

vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.

Pure statistical models

In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘ Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Models based on word ''n''-grams

Exponential

Maximum entropy language models encode the relationship between a word and the ''n''-gram history using feature functions. The equation is

P(w_m \mid w_1,\ldots,w_) = \frac \exp (a^T f(w_1,\ldots,w_m))

where

Z(w_1,\ldots,w_)

is the partition function,

a

is the parameter vector, and

f(w_1,\ldots,w_m)

is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain ''n''-gram. It is helpful to use a prior on

a

or some form of

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

. The log-bilinear model is another example of an exponential language model.

Skip-gram model

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in

-based language models (known also as ''continuous space language models''). Such continuous space embeddings help to alleviate the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.

Large language models

Although sometimes matching human performance, it is not clear whether they are plausible

cognitive model A cognitive model is a representation of one or more cognitive processes in humans or other animals for the purposes of comprehension and prediction. There are many types of cognitive models, and they can range from box-and-arrow diagrams to a se ...

s. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves. Various data sets have been developed for use in evaluating language processing systems. These include: *

Massive Multitask Language Understanding Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux. Overview MMLU con ...

(MMLU) * Corpus of Linguistic Acceptability * GLUE benchmark * Microsoft Research Paraphrase Corpus * Multi-Genre Natural Language Inference * Question Natural Language Inference * Quora Question Pairs * Recognizing Textual Entailment * Semantic Textual Similarity Benchmark * SQuAD question answering Test * Stanford Sentiment

Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empi ...

* Winograd NLI * BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs