The history of natural language processing describes the advances of
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. There is some overlap with the
history of machine translation
Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another.
In the 1950s, machine translation became a reality in research, although ref ...
, the
history of speech recognition, and the
history of artificial intelligence
The history of artificial intelligence ( AI) began in antiquity, with myths, stories, and rumors of artificial beings endowed with intelligence or consciousness by master craftsmen. The study of logic and formal reasoning from antiquity to t ...
.
Early history
The history of machine translation dates back to the seventeenth century, when philosophers such as
Leibniz
Gottfried Wilhelm Leibniz (or Leibnitz; – 14 November 1716) was a German polymath active as a mathematician, philosopher, scientist and diplomat who is credited, alongside Sir Isaac Newton, with the creation of calculus in addition to many ...
and
Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine.
The first patents for "translating machines" were applied for in the mid-1930s. One proposal, by
Georges Artsrouni was simply an automatic bilingual dictionary using
paper tape
Five- and eight-hole wide punched paper tape
Paper tape reader on the Harwell computer with a small piece of five-hole tape connected in a circle – creating a physical program loop
Punched tape or perforated paper tape is a form of data st ...
. The other proposal, by
Peter Troyanskii, a Russian, was more detailed.
Troyanski proposal included both the bilingual dictionary, and a method for dealing with grammatical roles between languages, based on
Esperanto
Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
.
Logical period
In 1950,
Alan Turing
Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer ...
published his famous article "
Computing Machinery and Intelligence
"Computing Machinery and Intelligence" is a seminal paper written by Alan Turing on the topic of artificial intelligence. The paper, published in 1950 in ''Mind (journal), Mind'', was the first to introduce his concept of what is now known as th ...
" which proposed what is now called the
Turing test
The Turing test, originally called the imitation game by Alan Turing in 1949,. Turing wrote about the ‘imitation game’ centrally and extensively throughout his 1950 text, but apparently retired the term thereafter. He referred to ‘ iste ...
as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human.
In 1957,
Noam Chomsky
Avram Noam Chomsky (born December 7, 1928) is an American professor and public intellectual known for his work in linguistics, political activism, and social criticism. Sometimes called "the father of modern linguistics", Chomsky is also a ...
’s ''
Syntactic Structures'' revolutionized Linguistics with '
universal grammar', a rule-based system of syntactic structures.
The
Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the
ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first
statistical machine translation
Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...
systems were developed.
Some notably successful NLP systems developed in the 1960s were
SHRDLU
SHRDLU is an early natural-language understanding computer program that was developed by Terry Winograd at MIT in 1968–1970. In the program, the user carries on a conversation with the computer, moving objects, naming collections and query ...
, a natural language system working in restricted "
blocks world
The blocks world is a planning domain in artificial intelligence. It consists of a set of wooden blocks of various shapes and colors sitting on a table. The goal is to build one or more vertical stacks of blocks. Only one block may be moved at ...
s" with restricted vocabularies.
In 1969
Roger Schank
Roger Carl Schank (March 12, 1946 – January 29, 2023) was an American artificial intelligence theorist, cognitive psychologist, learning scientist, educational reformer, and entrepreneur. Beginning in the late 1960s, he pioneered conceptual d ...
introduced the
conceptual dependency theory Conceptual dependency theory is a model of natural language understanding used in artificial intelligence systems.
Roger Schank at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was extens ...
for natural language understanding. This model, partially influenced by the work of
Sydney Lamb
Sydney MacDonald Lamb (born May 4, 1929 in Denver, Colorado) is an American linguist. He is the Arnold Professor Emeritus of Linguistics and Cognitive Science at Rice University. His scientific contributions have been wide-ranging, including thos ...
, was extensively used by Schank's students at
Yale University
Yale University is a Private university, private Ivy League research university in New Haven, Connecticut, United States. Founded in 1701, Yale is the List of Colonial Colleges, third-oldest institution of higher education in the United Stat ...
, such as Robert Wilensky, Wendy Lehnert, and
Janet Kolodner
Janet Lynne Kolodner is an American cognitive scientist and learning scientist. She is a Professor of the Practice at the Lynch School of Education at Boston College and co-lead of the MA Program in Learning Engineering. She is also
Regents' Pro ...
.
In 1970, William A. Woods introduced the
augmented transition network An augmented transition network or ATN is a type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelli ...
(ATN) to represent natural language input. Instead of ''
phrase structure rules
Phrase structure rules are a type of rewrite rule used to describe a given language's syntax and are closely associated with the early stages of transformational grammar, proposed by Noam Chomsky in 1957. They are used to break down a natural langu ...
'' ATNs used an equivalent set of
finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many
chatterbots
A chatbot (originally chatterbot) is a software application or web interface designed to have textual or spoken conversations. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of main ...
were written including
PARRY
Parry may refer to:
People
* Parry (surname)
* Parry (given name)
Fictional characters
* Parry, protagonist of the movie ''The Fisher King'', played by Robin Williams
* Parry in the series '' Incarnations of Immortality'' by Piers Anthony
* ...
,
Racter ''Racter'' is an artificial intelligence program that generates English language prose at random. It was published by Mindscape for IBM PC compatibles in 1984, then for the Apple II, Mac (computer), Mac, and Amiga. An expanded version of the softw ...
, and
Jabberwacky.
In recent years, advancements in deep learning and large language models have significantly enhanced the capabilities of natural language processing, leading to widespread applications in areas such as healthcare, customer service, and content generation.
Statistical period
Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms for language processing. This was due both to the steady increase in computational power resulting from
Moore's law
Moore's law is the observation that the Transistor count, number of transistors in an integrated circuit (IC) doubles about every two years. Moore's law is an observation and Forecasting, projection of a historical trend. Rather than a law of ...
and the gradual lessening of the dominance of
Chomskyan theories of linguistics (e.g.
transformational grammar
In linguistics, transformational grammar (TG) or transformational-generative grammar (TGG) was the earliest model of grammar proposed within the research tradition of generative grammar. Like current generative theories, it treated grammar as a sys ...
), whose theoretical underpinnings discouraged the sort of
corpus linguistics
Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
that underlies the machine-learning approach to language processing.
[Chomskyan linguistics encourages the investigation of "]corner case
In engineering, a corner case (or pathological case) involves a problem or situation that occurs only outside normal operating parameters—specifically one that manifests itself when multiple environmental variables or conditions are simultaneou ...
s" that stress the limits of its theoretical models (comparable to pathological
Pathology is the study of disease. The word ''pathology'' also refers to the study of disease in general, incorporating a wide range of biology research fields and medical practices. However, when used in the context of modern medical treatme ...
phenomena in mathematics), typically created using thought experiment
A thought experiment is an imaginary scenario that is meant to elucidate or test an argument or theory. It is often an experiment that would be hard, impossible, or unethical to actually perform. It can also be an abstract hypothetical that is ...
s, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in corpus linguistics
Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
. The creation and use of such corpora
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
of real-world data is a fundamental part of machine-learning algorithms for NLP. In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "poverty of the stimulus
In linguistics, the poverty of the stimulus is the claim that children are not exposed to rich enough data within their linguistic environments to acquire every feature of their language without innate language-specific cognitive biases. Argumen ...
" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing. Some of the earliest-used machine learning algorithms, such as
decision tree
A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
s, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on
statistical models
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
, which make soft,
probabilistic
Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
decisions based on attaching
real-valued
In mathematics, value may refer to several, strongly related notions.
In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...
weights to the features making up the input data. The
cache language model A cache language model is a type of statistical language model. These occur in the natural language processing subfield of computer science and assign probabilities to given sequences of words by means of a probability distribution. Statistical lang ...
s upon which many
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
Datasets
The emergence of statistical approaches was aided by both increase in computing power and the availability of large datasets. At that time, large multilingual corpora were starting to emerge. Notably, some were produced by the
Parliament of Canada
The Parliament of Canada () is the Canadian federalism, federal legislature of Canada. The Monarchy of Canada, Crown, along with two chambers: the Senate of Canada, Senate and the House of Commons of Canada, House of Commons, form the Bicameral ...
and the
European Union
The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...
as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.
Many of the notable early successes occurred in the field of
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
. In 1993, the
IBM alignment models
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is a publicly traded company ...
were used for
statistical machine translation
Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...
.
Compared to previous machine translation systems, which were symbolic systems manually coded by computational linguists, these systems were statistical, which allowed them to automatically learn from large
textual corpora. Though these systems do not work well in situations where only small corpora is available, so data-efficient methods continue to be an area of research and development.
In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word
disambiguation
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
Given that natural language requires ref ...
.
To take advantage of large, unlabelled datasets, algorithms were developed for
unsupervised and
self-supervised learning
Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...
. Generally, this task is much more difficult than
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
), which can often make up for the inferior results.
Neural period

In 1990, the
Elman network
Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
, using a
recurrent neural network
Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
, encoded each word in a training set as a vector, called a
word embedding
In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
, and the whole vocabulary as a
vector database
A vector database, vector store or vector search engine is a database that uses the vector space model to store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Nearest neighbor ...
, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple
multilayer perceptron
In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...
. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of
homonyms
In linguistics, homonyms are words which are either; ''homographs''—words that mean different things, but have the same spelling (regardless of pronunciation), or ''homophones''—words that mean different things, but have the same pronunciatio ...
.
Software
References
Bibliography
*
* .
* {{Russell Norvig 2003.
History of artificial intelligence
Natural language processing
History of linguistics
History of software
Software topical history overviews