A word ''n''-gram language model is a purely statistical model of language. It has been superseded by

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

–based models, which have been superseded by

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if ''n'' − 1 words, an ''n''-gram model. Special tokens are introduced to denote the start and end of a sentence

\langle s\rangle

and

\langle /s\rangle

. To prevent a zero probability being assigned to unseen words, each word's probability is slightly higher than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an

uninformative prior A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

) to more sophisticated models, such as Good–Turing discounting or back-off models.

Unigram model

A special case, where ''n'' = 1, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the sequence is equal to the word's probability in an entire document.

P_\text(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).

The model consists of units, each treated as one-state

finite automata A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number ...

. Words with their probabilities in a document can be illustrated as follows. Total mass of word probabilities distributed across the document's vocabulary, is 1.

\sum_ P(\text) = 1

The probability generated for a specific query is calculated as

P(\text) = \prod_ P(\text)

Unigram models of different documents have different probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Bigram model

In a bigram word (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as

P(\text) \approx P(\text\mid\langle s\rangle) P(\text\mid \text) P(\text\mid\text) P(\text\mid\text) P(\text\mid\text) P(\langle /s\rangle\mid \text)

Trigram model

In a trigram (''n'' = 3) language model, the approximation is

P(\text) \approx P(\text\mid \langle s\rangle,\langle s\rangle) P(\text\mid\langle s\rangle,I) P(\text\mid\text) P(\text\mid\text) P(\text\mid\text) P(\langle /s\rangle\mid\text)

Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted . Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''*I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''

Approximation method
The approximation method calculates the probability $P(w_1,\ldots,w_m)$ of observing the sentence $w_1,\ldots,w_m$ $P(w_1,\ldots,w_m) = \prod^m_ P(w_i\mid w_1,\ldots,w_)\approx \prod^m_ P(w_i\mid w_,\ldots,w_)$ It is assumed that the probability of observing the ''i''^th word ''w_i'' (in the context window consisting of the preceding ''i'' − 1 words) can be approximated by the probability of observing it in the shortened context window consisting of the preceding ''n'' − 1 words (''n''^th-order
Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Ma ...
). To clarify, for ''n'' = 3 and ''i'' = 2 we have $P(w_i\mid w_,\ldots,w_)=P(w_2\mid w_1)$ . The conditional probability can be calculated from ''n''-gram model frequency counts: $P(w_i\mid w_,\ldots,w_) = \frac$

Out-of-vocabulary words
An issue when using ''n''-gram language models are out-of-vocabulary (OOV) words. They are encountered in
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
and
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the ''n''-grams in the
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
that contain an out-of-vocabulary word are ignored. The ''n''-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed. Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. '''') into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special token before ''n''-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of ''n''-grams involving out-of-vocabulary words.

''n''-grams for approximate matching
''n''-grams were also used for approximate matching. If we convert strings (with only letters in the English alphabet) into character 3-grams, we get a $26^3$ -dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. However, we know empirically that if two strings of real text have a similar vector representation (as measured by
cosine distance In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided ...
) then they are likely to be similar. Other metrics have also been applied to vectors of ''n''-grams with varying, sometimes better, results. For example,
z-score In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...
s have been used to compare documents by examining how many standard deviations each ''n''-gram differs from its mean occurrence in a large collection, or
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
, of documents (which form the "background" vector). In the event of small counts, the g-score (also known as
g-test In statistics, ''G''-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended. Formulation The general formula for ''G'' i ...
) gave better results. It is also possible to take a more principled approach to the statistics of ''n''-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in
Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
. ''n''-gram-based searching was also used for
plagiarism detection Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plag ...
.

Bias–variance tradeoff
To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.

Smoothing techniques
There are problems of balance weight between ''infrequent grams'' (for example, if a proper name appeared in the training data) and ''frequent grams''. Also, items not seen in the training data will be given a
probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
of 0.0 without
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the d ...
. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds. In practice it was necessary to ''smooth'' the probability distributions by also assigning non-zero probabilities to unseen words or ''n''-grams. The reason is that models derived directly from the ''n''-gram frequency counts have severe problems when confronted with any ''n''-grams that have not explicitly been seen before – the zero-frequency problem. Various smoothing methods were used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen ''n''-grams; see
Rule of succession In probability theory, the rule of succession is a formula introduced in the 18th century by Pierre-Simon Laplace in the course of treating the sunrise problem. The formula is still used, particularly to estimate underlying probabilities when ...
) to more sophisticated models, such as Good–Turing discounting or back-off models. Some of these methods are equivalent to assigning a
prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
to the probabilities of the ''n''-grams and using
Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
to compute the resulting posterior ''n''-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations. *
Linear interpolation In mathematics, linear interpolation is a method of curve fitting using linear polynomials to construct new data points within the range of a discrete set of known data points. Linear interpolation between two known points If the two known po ...
(e.g., taking the
weighted mean The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
of the unigram, bigram, and trigram) * Good–Turing discounting * Witten–Bell discounting * Lidstone's smoothing *
Katz's back-off model Katz's Delicatessen, also known as Katz's of New York City, is a kosher-style delicatessen at 205 East Houston Street, on the southwest corner of Houston and Ludlow Streets on the Lower East Side of Manhattan in New York City.
(trigram) *
Kneser–Ney smoothing Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of ''n''-grams in a document based on their histories. It is widely considered the most effective method of smoo ...

Skip-gram language model
Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word ''n''-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are ''skipped'' over (thus the name "skip-gram"). Formally, a -skip--gram is a length- subsequence where the components occur at distance at most from each other. For example, in the input text: :''the rain in Spain falls mainly on the plain'' the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences :''the in'', ''rain Spain'', ''in falls'', ''Spain mainly'', ''falls on'', ''mainly the'', and ''on plain''. In skip-gram model, semantic relations between words are represented by
linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...
s, capturing a form of
compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
. For example, in some such models, if is the function that maps a word to its -d vector representation, then $v(\mathrm) - v(\mathrm) + v(\mathrm) \approx v(\mathrm)$ where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.

Syntactic ''n''-grams
Syntactic ''n''-grams are ''n''-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text. For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic ''n''-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial. Syntactic ''n''-grams are intended to reflect syntactic structure more faithfully than linear ''n''-grams, and have many of the same applications, especially as features in a
vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
. Syntactic ''n''-grams for certain tasks gives better results than the use of standard ''n''-grams, for example, for authorship attribution. Another type of syntactic ''n''-grams are part-of-speech ''n''-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech ''n''-grams have several applications, most commonly in information retrieval.

Other applications
''n''-grams find use in several areas of computer science,
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
, and applied mathematics. They have been used to: * design kernels that allow
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms such as
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
s to learn from string data * find likely candidates for the correct spelling of a misspelled wordU.S. Patent 6618697
Method for rule-based correction of spelling and grammar errors
/ref> * improve compression in compression algorithms where a small area of data requires ''n''-grams of greater length * assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems,
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
, OCR (
optical character recognition Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
), Intelligent Character Recognition ( ICR),
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
and similar applications * improve retrieval in
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents * improve retrieval performance in genetic sequence analysis as in the BLAST family of programs * identify the language a text is in or the species a small sequence of DNA was taken from * predict letters or words at random in order to create text, as in the dissociated press algorithm. *
cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic se ...

See also
*
Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...
* Feature engineering *
Hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
*
Longest common substring Long may refer to: Measurement * Long, characteristic of something of great duration * Long, characteristic of something of great length * Longitude (abbreviation: long.), a geographic coordinate * Longa (music), note value in early music mens ...
*
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 confer ...
*
n-tuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is on ...
* String kernel

References
{{reflist, refs= {{cite book , last1=Jurafsky , first1=Dan , last2=Martin , first2=James H. , title=Speech and Language Processing , date=7 January 2023 , edition=3rd edition draft , url=https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf , access-date=24 May 2022 , chapter=N-gram Language Models Language modeling Statistical natural language processing Markov models