Outline of natural language processing
   HOME

TheInfoList



OR:

The following
outline Outline or outlining may refer to: * Outline (list), a document summary, in hierarchical list format * Code folding, a method of hiding or collapsing code or text to see content in outline form * Outline drawing, a sketch depicting the outer edge ...
is provided as an overview of and topical guide to natural-language processing:
natural-language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
– computer activity in which computers are entailed to analyze, understand, alter, or generate
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
. This includes the
automation Automation describes a wide range of technologies that reduce human intervention in processes, namely by predetermining decision criteria, subprocess relationships, and related actions, as well as embodying those predeterminations in machines ...
of any or all linguistic forms, activities, or methods of communication, such as
conversation Conversation is interactive communication between two or more people. The development of conversational skills and etiquette is an important part of socialization. The development of conversational skills in a new language is a frequent focus ...
, correspondence,
reading Reading is the process of taking in the sense or meaning of Letter (alphabet), letters, symbols, etc., especially by Visual perception, sight or Somatosensory system, touch. For educators and researchers, reading is a multifaceted process invo ...
, written composition,
dictation Dictation can refer to: *Dictation (exercise), when one person speaks while another person transcribes *'' Dictation: A Quartet'', a collection of short stories by Cynthia Ozick, published in 2008 * Digital dictation, the use of digital electronic ...
,
publishing Publishing is the activity of making information, literature, music, software and other content available to the public for sale or for free. Traditionally, the term refers to the creation and distribution of printed works, such as books, newsp ...
,
translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
,
lip reading The lips are the visible body part at the mouth of many animals, including humans. Lips are soft, movable, and serve as the opening for food intake and in the articulation of sound and speech. Human lips are a tactile sensory organ, and can be ...
, and so on. Natural-language processing is also the name of the branch of
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
,
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
, and
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
concerned with enabling computers to engage in communication using natural language(s) in all forms, including but not limited to
speech Speech is a human vocal communication using language. Each language uses Phonetics, phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if ...
, print,
writing Writing is a medium of human communication which involves the representation of a language through a system of physically Epigraphy, inscribed, Printing press, mechanically transferred, or Word processor, digitally represented Symbols (semiot ...
, and signing.


Natural-language processing

Natural-language processing can be described as all of the following: * A field of
science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...
– systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ** An
applied science Applied science is the use of the scientific method and knowledge obtained via conclusions from the method to attain practical goals. It includes a broad range of disciplines such as engineering and medicine. Applied science is often contrasted ...
– field that applies human knowledge to build or design useful things. *** A field of
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
– scientific and practical approach to computation and its applications. **** A branch of
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
– intelligence of machines and robots and the branch of computer science that aims to create it. **** A subfield of
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
– interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective. ** An application of
engineering Engineering is the use of scientific method, scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad rang ...
– science, skill, and profession of acquiring and applying scientific, economic, social, and practical knowledge, in order to design and also build structures, machines, devices, systems, materials and processes. *** An application of
software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...
– application of a systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software.
SWEBOK The ''Software Engineering Body of Knowledge'' (SWEBOK ( )) is an international standard ISO/IEC TR 19759:2005 specifying a guide to the generally accepted software engineering body of knowledge. The Guide to the Software Engineering Body of Know ...
**** A subfield of
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
– process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages (such as Java, C++, C#, Python, etc.). The purpose of programming is to create a set of instructions that computers use to perform specific operations or to exhibit desired behaviors. ***** A subfield of
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
programming – * A type of
system A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment (systems), environment, is described by its boundaries, ...
– set of interacting or interdependent components forming an integrated whole or a set of elements (often called 'components' ) and relationships which are different from relationships of the set or its elements to other elements or sets. ** A system that includes
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
– software is a collection of computer programs and related data that provides the instructions for telling a computer what to do and how to do it. Software refers to one or more computer programs and data held in the storage of the computer. In other words, software is a set of programs, procedures, algorithms and its documentation concerned with the operation of a data processing system. * A type of
technology Technology is the application of knowledge to reach practical goals in a specifiable and reproducible way. The word ''technology'' may also mean the product of such an endeavor. The use of technology is widely prevalent in medicine, science, ...
– making, modification, usage, and knowledge of tools, machines, techniques, crafts, systems, methods of organization, in order to solve a problem, improve a preexisting solution to a problem, achieve a goal, handle an applied input/output relation or perform a specific function. It can also refer to the collection of such tools, machinery, modifications, arrangements and procedures. Technologies significantly affect human as well as other animal species' ability to control and adapt to their natural environments. ** A form of
computer technology Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
– computers and their application. NLP makes use of computers, image scanners, microphones, and many types of software programs. ***
Language technology Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broa ...
– consists of natural-language processing (NLP) and computational linguistics (CL) on the one hand, and speech technology on the other. It also includes many application oriented aspects of these. It is often called human language technology (HLT).


Prerequisite technologies

The following technologies make natural-language processing possible: *
Communication Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inquir ...
– the activity of a source sending a message to a receiver **
Language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of met ...
– ***
Speech Speech is a human vocal communication using language. Each language uses Phonetics, phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if ...
– ***
Writing Writing is a medium of human communication which involves the representation of a language through a system of physically Epigraphy, inscribed, Printing press, mechanically transferred, or Word processor, digitally represented Symbols (semiot ...
– **
Computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
– ***
Computer A computer is a machine that can be programmed to Execution (computing), carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as C ...
s – ***
Computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
– ****
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
– ****
User interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...
– ***
Software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
– **** Text editing – program used to edit plain
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...
s ****
Word processing A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...
– piece of software used for composing, editing, formatting, printing documents ***
Input device In computing, an input device is a piece of equipment used to provide data and control signals to an information processing system, such as a computer or information appliance. Examples of input devices include keyboards, mouse, scanners, cameras ...
s – pieces of hardware for sending data to a computer to be processed ****
Computer keyboard A computer keyboard is a peripheral input device modeled after the typewriter keyboard which uses an arrangement of buttons or keys to act as mechanical levers or electronic switches. Replacing early punched cards and paper tape technology ...
– typewriter style input device whose input is converted into various data depending on the circumstances ****
Image scanner An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop ''flatbed scanner'' ...
s –


Subfields of natural-language processing

*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
(IE) – field concerned in general with the extraction of semantic information from text. This covers tasks such as
named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
,
coreference resolution In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
,
relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE add ...
, etc. *
Ontology engineering In computer science, information science and systems engineering, ontology engineering is a field which studies the methods and methodologies for building ontologies, which encompasses a representation, formal naming and definition of the categori ...
– field that studies the methods and methodologies for building ontologies, which are formal representations of a set of concepts within a domain and the relationships between those concepts. *
Speech processing Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied t ...
– field that covers
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
,
text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
and related tasks. *
Statistical natural-language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
– **
Statistical semantics In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of informat ...
– a subfield of
computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computatio ...
that establishes semantic relations between words to examine their contexts. ***
Distributional semantics Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
– a subfield of
statistical semantics In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of informat ...
that examines the semantic relationship of words across a corpora or in large samples of data.


Related fields

Natural-language processing contributes to, and makes use of (the theories, tools, and methodologies from), the following fields: *
Automated reasoning In computer science, in particular in knowledge representation and reasoning and metalogic, the area of automated reasoning is dedicated to understanding different aspects of reasoning. The study of automated reasoning helps produce computer progra ...
– area of computer science and mathematical logic dedicated to understanding various aspects of reasoning, and producing software which allows computers to reason completely, or nearly completely, automatically. A sub-field of artificial intelligence, automatic reasoning is also grounded in theoretical computer science and philosophy of mind. *
Linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
– scientific study of human language. Natural-language processing requires understanding of the structure and application of language, and therefore it draws heavily from linguistics. **
Applied linguistics Applied linguistics is an interdisciplinary field which identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, psychology, communication rese ...
– interdisciplinary field of study that identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, linguistics, psychology, computer science, anthropology, and sociology. Some of the subfields of applied linguistics relevant to natural-language processing are: *** Bilingualism / Multilingualism – ***
Computer-mediated communication Computer-mediated communication (CMC) is defined as any human communication that occurs through the use of two or more electronic devices. While the term has traditionally referred to those communications that occur via computer-mediated formats ...
(CMC) – any communicative transaction that occurs through the use of two or more networked computers. Research on CMC focuses largely on the social effects of different computer-supported communication technologies. Many recent studies involve Internet-based
social networking A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for an ...
supported by
social software Social software, also known as social apps or social platform, include communications and interactive tools that are often based on the Internet. Communication tools typically handle the capturing, storing and presentation of communication, usua ...
. ***
Contrastive linguistics Contrastive linguistics is a practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages (hence it is occasionally called "''differential'' linguistics"). History While traditional ...
– practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages. ***
Conversation analysis Conversation analysis (CA) is an approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life. CA originated as a sociological method, but has since spread to other fields. CA began with ...
(CA) – approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life.
Turn-taking Turn-taking is a type of organization in conversation and discourse where participants speak one at a time in alternating turns. In practice, it involves processes for constructing contributions, responding to previous comments, and transitioning ...
is one aspect of language use that is studied by CA. ***
Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event. The objects of discourse Analysis ( discourse, writing, conversation, communicative event ...
– various approaches to analyzing written, vocal, or sign language use or any significant semiotic event. ***
Forensic linguistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
– application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. ***
Interlinguistics Interlinguistics, as the science of planned languages, has existed for more than a century as a specific branch of linguistics for the study of various aspects of linguistic communication. Interlinguistics is a discipline formalized by Otto Jespers ...
– study of improving communications between people of different first languages with the use of ethnic and auxiliary languages (lingua franca). For instance by use of intentional international auxiliary languages, such as Esperanto or Interlingua, or spontaneous interlanguages known as pidgin languages. ***
Language assessment Language assessment or language testing is a field of study under the umbrella of applied linguistics. Its main focus is the assessment of first, second or other language in the school, college, or university context; assessment of language use ...
– assessment of first, second or other language in the school, college, or university context; assessment of language use in the workplace; and assessment of language in the immigration, citizenship, and asylum contexts. The assessment may include analyses of listening, speaking, reading, writing or cultural understanding, with respect to understanding how the language works theoretically and the ability to use the language practically. ***
Language pedagogy Language pedagogy is the discipline concerned with the theories and techniques of teaching language. It has been described as a type of teaching wherein the teacher draws from his prior knowledge and actual experience in teaching language. The appr ...
– science and art of language education, including approaches and methods of language teaching and study. Natural-language processing is used in programs designed to teach language, including first- and second-language training. ***
Language planning In sociolinguistics, language planning (also known as language engineering) is a deliberate effort to influence the function, structure or acquisition of languages or language varieties within a speech community.Kaplan B., Robert, and Richard ...
– ***
Language policy Language policy is an interdisciplinary academic field. Some scholars such as Joshua Fishman and Ofelia García consider it as part of sociolinguistics. On the other hand, other scholars such as Bernard SpolskyRobert B. Kaplanand Joseph Lo Bianco ...
– ***
Lexicography Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoretica ...
– *** Literacies – ***
Pragmatics In linguistics and related fields, pragmatics is the study of how context contributes to meaning. The field of study evaluates how human language is utilized in social interactions, as well as the relationship between the interpreter and the int ...
– ***
Second-language acquisition Second-language acquisition (SLA), sometimes called second-language learning — otherwise referred to as L2 (language 2) acquisition, is the process by which people learn a second language. Second-language acquisition is also the scientific dis ...
– *** Stylistics – ***
Translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
– **
Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
– interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective. The models and tools of computational linguistics are used extensively in the field of natural-language processing, and vice versa. ***
Computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computatio ...
– ***
Corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
– study of language as expressed in samples ''(corpora)'' of "real world" text. ''Corpora'' is the plural of ''corpus'', and a corpus is a specifically selected collection of texts (or speech segments) composed of natural language. After it is constructed (gathered or composed), a corpus is analyzed with the methods of computational linguistics to infer the meaning and context of its components (words, phrases, and sentences), and the relationships between them. Optionally, a corpus can be annotated ("tagged") with data (manually or automatically) to make the corpus easier to understand (e.g.,
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
). This data is then applied to make sense of user input, for example, to make better (automated) guesses of what people are talking about or saying, perhaps to achieve more narrowly focused web searches, or for speech recognition. **
Metalinguistics Metalinguistics is the branch of linguistics that studies language and its relationship to other cultural behaviors. It is the study of dialogue relationships between units of speech communication as manifestations and enactments of co-existence. ...
– ** Sign linguistics – scientific study and analysis of natural sign languages, their features, their structure (phonology, morphology, syntax, and semantics), their acquisition (as a primary or secondary language), how they develop independently of other languages, their application in communication, their relationships to other languages (including spoken languages), and many other aspects. *
Human–computer interaction Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people (users) and computers. HCI researchers observe the ways humans interact with computers and design tec ...
– the intersection of computer science and behavioral sciences, this field involves the study, planning, and design of the interaction between people (users) and computers. Attention to human-machine interaction is important, because poorly designed human-machine interfaces can lead to many unexpected problems. A classic example of this is the
Three Mile Island accident The Three Mile Island accident was a partial meltdown of the Three Mile Island, Unit 2 (TMI-2) reactor in Pennsylvania, United States. It began at 4 a.m. on March 28, 1979. It is the most significant accident in U.S. commercial nuclea ...
where investigations concluded that the design of the human–machine interface was at least partially responsible for the disaster. *
Information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
(IR) – field concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP. *
Knowledge representation Knowledge representation and reasoning (KRR, KR&R, KR²) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medic ...
(KR) – area of artificial intelligence research aimed at representing knowledge in symbols to facilitate inferencing from those knowledge elements, creating new elements of knowledge. Knowledge Representation research involves analysis of how to reason accurately and effectively and how best to use a set of symbols to represent a set of facts within a knowledge domain. **
Semantic network A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, ...
– study of semantic relations between concepts. *** Semantic Web – *
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
– subfield of computer science that examines pattern recognition and computational learning theory in artificial intelligence. There are three broad approaches to machine learning.
Supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
occurs when the machine is given example inputs and outputs by a teacher so that it can learn a rule that maps inputs to outputs.
Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
occurs when the machine determines the inputs structure without being provided example inputs or outputs.
Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
occurs when a machine must perform a goal without teacher feedback. **
Pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
– branch of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
that examines how machines recognize regularities in data. As with machine learning, teachers can train machines to recognize patterns by providing them with example inputs and outputs (i.e.
Supervised Learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
), or the machines can recognize patterns without being trained on any example inputs or outputs (i.e.
Unsupervised Learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
). **
Statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagno ...


Structures used in natural-language processing

* Anaphora – type of expression whose reference depends upon another referential element. E.g., in the sentence 'Sally preferred the company of herself', 'herself' is an anaphoric expression in that it is coreferential with 'Sally', the sentence's subject. *
Context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
– *
Controlled natural language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major typ ...
– a natural language with a restriction introduced on its grammar and vocabulary in order to eliminate ambiguity and complexity * Corpus – body of data, optionally tagged (for example, through
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
), providing real world samples for analysis and comparison. **
Text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
– large and structured set of texts, nowadays usually electronically stored and processed. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific subject (or ''domain''). **
Speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or spea ...
– database of speech audio files and text transcriptions. In Speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition engine). In Linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields. *
Grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...
– **
Context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
(CFG) – ** Constraint grammar (CG) – **
Definite clause grammar A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. It is closely related to the concept of attribute grammars / affix grammars from which Prolog ...
(DCG) – ** Functional unification grammar (FUG) – **
Generalized phrase structure grammar Generalized phrase structure grammar (GPSG) is a framework for describing the syntax and semantics of natural languages. It is a type of constraint-based phrase structure grammar. Constraint based grammars are based around defining certain syntacti ...
(GPSG) – ** Head-driven phrase structure grammar (HPSG) – ** Lexical functional grammar (LFG) – **
Probabilistic context-free grammar Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA struct ...
(PCFG) – another name for stochastic context-free grammar. ** Stochastic context-free grammar (SCFG) – **
Systemic functional grammar Systemic functional grammar (SFG) is a form of grammatical description originated by Michael Halliday. It is part of a social semiotic approach to language called '' systemic functional linguistics''. In these two terms, ''systemic'' refers to ...
(SFG) – **
Tree-adjoining grammar Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free gra ...
(TAG) – *
Natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
– * ''n''-gram – sequence of ''n'' number of tokens, where a "token" is a character, syllable, or word. The ''n'' is replaced by a number. Therefore, a 5-gram is an ''n''-gram of 5 letters, syllables, or words. "Eat this" is a 2-gram (also known as a bigram). **
Bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...
– ''n''-gram of 2 tokens. Every sequence of 2 adjacent elements in a string of tokens is a bigram. Bigrams are used for speech recognition, they can be used to solve cryptograms, and bigram frequency is one approach to statistical language identification. **
Trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Contex ...
– special case of the ''n''-gram, where ''n'' is 3. *
Ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exis ...
– formal representation of a set of concepts within a domain and the relationships between those concepts. **
Taxonomy Taxonomy is the practice and science of categorization or classification. A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
– practice and science of classification, including the principles underlying classification, and the methods of classifying things or concepts. ***
Hyponymy and hypernymy In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In other wor ...
– the linguistics of hyponyms and hypernyms. A hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hypernym); which, in turn, is a hyponym of animal. ***
Taxonomy for search engines Taxonomy for search engines refers to classification methods that improve relevance in vertical search. Taxonomies of entities are tree structures whose nodes are labelled with entities likely to occur in a web search query. Searches use these tree ...
– typically called a "taxonomy of entities". It is a
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
in which nodes are labelled with entities which are expected to occur in a web search query. These trees are used to match keywords from a search query with the keywords from relevant answers (or snippets). *
Textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...
– directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain. *
Triphone In linguistics, a triphone is a sequence of three consecutive phonemes. Triphones are useful in models of natural language processing where they are used to establish the various contexts in which a phoneme can occur in a particular natural languag ...
– sequence of three phonemes. Triphones are useful in models of natural-language processing where they are used to establish the various contexts in which a phoneme can occur in a particular natural language.


Processes of NLP


Applications

*
Automated essay scoring Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to c ...
(AES) – the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural-language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. *
Automatic image annotation Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of comput ...
– process by which a computer system automatically assigns textual metadata in the form of captioning or keywords to a digital image. The annotations are used in image retrieval systems to organize and locate images of interest from a database. *
Automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commo ...
– process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. ** Types *** Keyphrase extraction – *** Document summarization – ****
Multi-document summarization Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The resulting summary report allows individual users, such as professional information consumers, to quickl ...
– ** Methods and techniques *** Extraction-based summarization – *** Abstraction-based summarization – *** Maximum entropy-based summarization – *** Sentence extraction – *** Aided summarization – **** Human aided machine summarization (HAMS) – **** Machine aided human summarization (MAHS) – * Automatic taxonomy induction – automated construction of
tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is gener ...
s from a corpus. This may be applied to building taxonomical classification systems for reading by end users, such as web directories or subject outlines. *
Coreference resolution In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
– in order to derive the correct interpretation of text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions need to be connected to the right individuals or objects. Given a sentence or larger chunk of text, coreference resolution determines which words ("mentions") refer to which objects ("entities") included in the text. **
Anaphora resolution In linguistics, anaphora () is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an a ...
– concerned with matching up pronouns with the nouns or names that they refer to. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to). *
Dialog system A dialogue system, or conversational agent (CA), is a computer system intended to converse with a human. Dialogue systems employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and o ...
– * Foreign-language reading aid – computer program that assists a non-native language user to read properly in their target language. The proper reading means that the pronunciation should be correct and stress to different parts of the words should be proper. * Foreign-language writing aid – computer program or any other instrument that assists a non-native language user (also referred to as a foreign-language learner) in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. * Grammar checking – the act of verifying the grammatical correctness of written text, especially if this act is performed by a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer program ...
. *
Information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
– **
Cross-language information retrieval Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has man ...
– *
Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
(MT) – aims to automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "
AI-complete In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, implying that the difficulty of these computational problems, assuming intelligence is computational, is equivalent to that of solv ...
", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly. ** Classical approach of machine translation – rules-based machine translation. **
Computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
– ***
Interactive machine translation Interactive machine translation (IMT), is a specific sub-field of computer-aided translation. Under this translation paradigm, the computer software that assists the human translator attempts to predict the text the user is going to input by taking ...
– ***
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translat ...
– database that stores so-called "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. **
Example-based machine translation Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be vi ...
– **
Rule-based machine translation Rule-based machine translation (RBMT; "Classical Approach" of MT) is machine translation systems based on linguistic information about source and target languages basically retrieved from (unilingual, bilingual or multilingual) dictionaries and gram ...
– *
Natural-language programming Natural-language programming (NLP) is an ontology-assisted way of programming in terms of natural-language sentences, e.g. English. A structured document with Content, sections and subsections for explanations of sentences forms a NLP documen ...
– interpreting and compiling instructions communicated in natural language into computer instructions (machine code). * Natural-language search – *
Optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
(OCR) – given an image representing printed text, determine the corresponding text. *
Question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
– given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). **
Open domain question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural ...
– *
Spam filtering Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...
– *
Sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
– extracts subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing. *
Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
– given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of
text to speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
and is one of the extremely difficult problems colloquially termed "
AI-complete In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, implying that the difficulty of these computational problems, assuming intelligence is computational, is equivalent to that of solv ...
" (see above). In
natural speech In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has linguistic evolution, evolved naturally in humans through use and repetition without conscious planning or premeditati ...
there are hardly any pauses between successive words, and thus
speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed
coarticulation Coarticulation in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. There are two types of coarticulation: ''anticipatory coarticulat ...
, so the conversion of the analog signal to discrete characters can be a very difficult process. *
Speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
(Text-to-speech) – *
Text-proofing Proofreading is the reading of a galley proof or an electronic copy of a publication to find and correct reproduction errors of text or art. Proofreading is the final step in the editorial cycle before publication. Professional Traditional m ...
– *
Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
– automated editing a document to include fewer words, or use easier words, while retaining its underlying meaning and information.


Component processes

*
Natural-language understanding Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an A ...
– converts chunks of text into more formal representations such as
first-order logic First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantifie ...
structures that are easier for
computer A computer is a machine that can be programmed to Execution (computing), carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as C ...
programs to manipulate. Natural-language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural-language expression which usually takes the form of organized notations of natural-languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural-languages semantics without confusions with implicit assumptions such as
closed-world assumption The closed-world assumption (CWA), in a formal system of logic used for knowledge representation, is the presumption that a statement that is true is also known to be true. Therefore, conversely, what is not currently known to be true, is false. T ...
(CWA) vs.
open-world assumption In a formal system of logic used for knowledge representation, the open-world assumption is the assumption that the truth value of a statement may be true irrespective of whether or not it is ''known'' to be true. It is the opposite of the close ...
, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. *
Natural-language generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics th ...
– task of converting information from computer databases into readable human language.


Component processes of natural-language understanding

*
Automatic document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectuall ...
(text categorization) – **
Automatic language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...
– *
Compound term processing Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is ...
– category of techniques that identify compound terms and match them to their definitions. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term. * Automatic taxonomy induction – * Corpus processing – **
Automatic acquisition of lexicon Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage ...
– **
Text normalization Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consis ...
– **
Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
– * Deep linguistic processing – *
Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event. The objects of discourse Analysis ( discourse, writing, conversation, communicative event ...
– includes a number of related tasks. One task is identifying the
discourse Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. ...
structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the
speech act In the philosophy of language and linguistics, speech act is something expressed by an individual that not only presents information but performs an action as well. For example, the phrase "I would like the kimchi; could you please pass it to me?" ...
s in a chunk of text (e.g. yes-no questions, content questions, statements, assertions, orders, suggestions, etc.). *
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
– **
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
– process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. ***
Biomedical text mining Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, bi ...
– (also known as BioNLP), this is text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field drawing elements from natural-language processing, bioinformatics, medical informatics and computational linguistics. There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed. ***
Decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...
– *** Sentence extraction – **
Terminology extraction Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...
– *
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
– *
Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemma ...
– groups together all like terms that share a same lemma such that they are classified as a single item. * Morphological segmentation – separates words into individual
morphemes A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology. In English, morphemes are often but not necessarily words. Morphemes that stand alone a ...
and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the
morphology Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines * Morphology (archaeology), study of the shapes or forms of artifacts * Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies ...
(i.e. the structure of words) of the language being considered.
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
has fairly simple morphology, especially
inflectional morphology In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and de ...
, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. *
Named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
(NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although
capitalization Capitalization (American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term a ...
can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g.
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
or
Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example,
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
capitalizes all
noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Living creatures (including people, alive, d ...
s, regardless of whether they refer to names, and French and
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...
do not capitalize names that serve as
adjective In linguistics, an adjective (list of glossing abbreviations, abbreviated ) is a word that generally grammatical modifier, modifies a noun or noun phrase or describes its referent. Its semantic role is to change information given by the noun. Tra ...
s. *
Ontology learning Ontology learning (ontology extraction, ontology generation, or ontology acquisition) is the automatic or semi-automatic creation of ontology (information science), ontologies, including extracting the corresponding Domain of discourse, domain's te ...
– automatic or semi-automatic creation of
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an
ontology language In computer science and artificial intelligence, ontology languages are formal languages used to construct ontologies. They allow the encoding of knowledge about specific domains and often include reasoning rules that support the processing of th ...
for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition". *
Parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
– determines the
parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...
(grammatical analysis) of a given sentence. The
grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...
for
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
s is
ambiguous Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations wikt:plausible#Adjective, plausible. A common aspect of ambiguity is uncertainty. It ...
and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). **
Shallow parsing Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings ...
– *
Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
– given a sentence, determines the
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...
for each word. Many words, especially common ones, can serve as multiple
parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...
. For example, "book" can be a
noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Living creatures (including people, alive, d ...
("the book on the table") or
verb A verb () is a word (part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descri ...
("to book a flight"); "set" can be a
noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Living creatures (including people, alive, d ...
,
verb A verb () is a word (part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descri ...
or
adjective In linguistics, an adjective (list of glossing abbreviations, abbreviated ) is a word that generally grammatical modifier, modifies a noun or noun phrase or describes its referent. Its semantic role is to change information given by the noun. Tra ...
; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little
inflectional morphology In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and de ...
, such as
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
are particularly prone to such ambiguity.
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
is prone to such ambiguity because it is a
tonal language Tone is the use of pitch in language to distinguish lexical or grammatical meaning – that is, to distinguish or to inflect words. All verbal languages use pitch to express emotional and other paralinguistic information and to convey empha ...
during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning. * Query expansion – *
Relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE add ...
– given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom). * Semantic analysis (computational) – formal analysis of meaning, and "computational" refers to approaches that in principle support effective implementation. **
Explicit semantic analysis In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word i ...
– **
Latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
– **
Semantic analytics Semantic analytics, also termed ''semantic relatedness'', is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF. Semantic analytics measures the relate ...
– *
Sentence breaking Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing to ...
(also known as
sentence boundary disambiguation Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing too ...
and sentence detection) – given a chunk of text, finds the sentence boundaries. Sentence boundaries are often marked by
period Period may refer to: Common uses * Era, a length or span of time * Full stop (or period), a punctuation mark Arts, entertainment, and media * Period (music), a concept in musical composition * Periodic sentence (or rhetorical period), a concept ...
s or other
punctuation mark Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
s, but these same characters can serve other purposes (e.g. marking
abbreviation An abbreviation (from Latin ''brevis'', meaning ''short'') is a shortened form of a word or phrase, by any method. It may consist of a group of letters or words taken from the full version of the word or phrase; for example, the word ''abbrevia ...
s). *
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
– given a sound clip of a person or people speaking, separates it into words. A subtask of
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
and typically grouped with it. *
Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
– reduces an inflected or derived word into its
word stem In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem is ...
, base, or
root In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...
form. * Text chunking – *
Tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ...
– given a chunk of text, separates it into distinct words, symbols, sentences, or other units * Topic segmentation and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment. *
Truecasing Truecasing, also called capitalization recovery, capitalization correction, or case restoration, is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This com ...
– *
Word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in compu ...
– separates a chunk of continuous text into separate words. For a language like
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
, this is fairly trivial, since words are usually separated by spaces. However, some written languages like
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
,
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
and
Thai Thai or THAI may refer to: * Of or from Thailand, a country in Southeast Asia ** Thai people, the dominant ethnic group of Thailand ** Thai language, a Tai-Kadai language spoken mainly in and around Thailand *** Thai script *** Thai (Unicode block ...
do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the
vocabulary A vocabulary is a set of familiar words within a person's language. A vocabulary, usually developed with age, serves as a useful and fundamental tool for communication and acquiring knowledge. Acquiring an extensive vocabulary is one of the la ...
and
morphology Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines * Morphology (archaeology), study of the shapes or forms of artifacts * Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies ...
of words in the language. *
Word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consc ...
(WSD) – because many words have more than one meaning, word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...
. **
Word-sense induction In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sens ...
– open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context. **
Automatic acquisition of sense-tagged corpora The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexica ...
– *
W-shingling In natural language processing a ''w-shingling'' is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity be ...
– set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.


Component processes of natural-language generation

Natural-language generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics th ...
– task of converting information from computer databases into readable human language. * Automatic taxonomy induction (ATI) – automated building of
tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is gener ...
s from a corpus. While ATI is used to construct the core of ontologies (and doing so makes it a component process of natural-language understanding), when the ontologies being constructed are end user readable (such as a subject outline), and these are used for the construction of further documentation (such as using an outline as the basis to construct a report or treatise) this also becomes a component process of natural-language generation. *
Document structuring Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping (for example into paragraphs) of sentences in a generated text. It is closely related to the Content determination NLG task. Example ...


History of natural-language processing

History of natural-language processing *
History of machine translation Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. In the 1950s, machine translation became a reality in research, although ref ...
* History of automated essay scoring * History of natural-language user interface * History of natural-language understanding * History of optical character recognition * History of question answering * History of speech synthesis *
Turing test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to artificial intelligence, exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing propos ...
– test of a machine's ability to exhibit intelligent behavior, equivalent to or indistinguishable from, that of an actual human. In the original illustrative example, a human judge engages in a natural-language conversation with a human and a machine designed to generate performance indistinguishable from that of a human being. All participants are separated from one another. If the judge cannot reliably tell the machine from the human, the machine is said to have passed the test. The test was introduced by Alan Turing in his 1950 paper "Computing Machinery and Intelligence," which opens with the words: "I propose to consider the question, 'Can machines think?'" *
Universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible hum ...
– theory in
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
, usually credited to
Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...
, proposing that the ability to learn grammar is hard-wired into the brain. The theory suggests that linguistic ability manifests itself without being taught (''see''
poverty of the stimulus Poverty of the stimulus (POS) is the controversial argument from linguistics that children are not exposed to rich enough data within their linguistic environments to acquire every feature of their language. This is considered evidence contrary to ...
), and that there are properties that all natural
human languages Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of met ...
share. It is a matter of observation and experimentation to determine precisely what abilities are innate and what properties are shared by all languages. * ALPAC – was a committee of seven scientists led by John R. Pierce, established in 1964 by the U. S. Government in order to evaluate the progress in computational linguistics in general and machine translation in particular. Its report, issued in 1966, gained notoriety for being very skeptical of research done in machine translation so far, and emphasizing the need for basic research in computational linguistics; this eventually caused the U. S. Government to reduce its funding of the topic dramatically. *
Conceptual dependency theory Conceptual dependency theory is a model of natural language understanding used in artificial intelligence systems. Roger Schank at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was exten ...
– a model of natural-language understanding used in artificial intelligence systems.
Roger Schank Roger Carl Schank (born 1946) is an American artificial intelligence theorist, cognitive psychologist, learning scientist, educational reformer, and entrepreneur. Beginning in the late 1960s, he pioneered conceptual dependency theory (within th ...
at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was extensively used by Schank's students at Yale University such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. *
Augmented transition network An augmented transition network or ATN is a type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intell ...
– type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelligence. Introduced by William A. Woods in 1970. *
Distributed Language Translation Distributed Language Translation ( eo, Distribuita Lingvo-Tradukado, DLT) was a project to develop an interlingual machine translation system for twelve European languages. It ran between 1985 and 1990. :The distinctive feature of DLT was the use ...
(project) –


Timeline of NLP software


General natural-language processing concepts

* Sukhotin's algorithm – statistical classification algorithm for classifying characters in a text as vowels or consonants. It was initially created by Boris V. Sukhotin. *
T9 (predictive text) T9 is a predictive text technology for mobile phones (specifically those that contain a Telephone keypad, 3×4 numeric keypad), originally developed by Tegic, Tegic Communications, now part of Nuance Communications. T9 stands for ''Text on 9 keys ...
– stands for "Text on 9 keys", is a USA-patented predictive text technology for mobile phones (specifically those that contain a 3x4 numeric keypad), originally developed by Tegic Communications, now part of Nuance Communications. *
Tatoeba Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (), meaning "for example". It is written and maintained by a community of volunteer ...
– free collaborative online database of example sentences geared towards foreign-language learners. * Teragram Corporation – fully owned subsidiary of SAS Institute, a major producer of statistical analysis software, headquartered in Cary, North Carolina, USA. Teragram is based in Cambridge, Massachusetts and specializes in the application of computational linguistics to multilingual natural-language processing. * TipTop Technologies – company that developed TipTop Search, a real-time web, social search engine with a unique platform for semantic analysis of natural language. TipTop Search provides results capturing individual and group sentiment, opinions, and experiences from content of various sorts including real-time messages from Twitter or consumer product reviews on Amazon.com. *
Transderivational search Transderivational search (often abbreviated to TDS) is a psychology, psychological and cybernetics term, meaning when a search is being conducted for a fuzzy logic, fuzzy match across a broad field. In computing the equivalent function can be perf ...
– when a search is being conducted for a fuzzy match across a broad field. In computing the equivalent function can be performed using content-addressable memory. * Vocabulary mismatch – common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. *
LRE Map The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major N ...
– *
Reification (linguistics) Reification in natural language processing refers to where a natural language statement is transformed so actions and events in it become quantifiable variables. For example "John chased the duck furiously" can be transformed into something like ...
– * Semantic Web – **
Metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
– *
Spoken dialogue system A spoken dialog system (SDS) is a computer system able to converse with a human with voice. It has two essential components that do not exist in a written text dialog system: a speech recognizer and a text-to-speech module (written text dialog syst ...
– * Affix grammar over a finite lattice – * Aggregation (linguistics) – *
Bag-of-words model The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
– model that represents a text as a bag (multiset) of its words that disregards grammar and word sequence, but maintains multiplicity. This model is a commonly used to train document classifiers * Brill tagger – * Cache language model – *
ChaSen ChaSen is a morphological parser for the Japanese language. This tool for analyzing morphemes was developed at the Matsumoto laboratory, Nara Institute of Science and Technology. See also * MeCab MeCab is an open-source text segmentation lib ...
,
MeCab MeCab is an open-source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on th ...
– provide morphological analysis and word splitting for
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
*
Classic monolingual WSD Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora: *Classic English WSD uses the Prin ...
– *
ClearForest ClearForest was an Israeli software company that developed and marketed text analytics and text mining solutions. History Founded in 1998, ClearForest had its headquarters just outside Boston and a development center in Or Yehuda. The company w ...
– *
CMU Pronouncing Dictionary The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research. CMUdict provides a mapping orthog ...
– also known as ''cmudict'', is a public domain pronouncing dictionary designed for uses in speech technology, and was created by
Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...
(CMU). It defines a mapping from English words to their North American pronunciations, and is commonly used in speech processing applications such as the
Festival Speech Synthesis System The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Subst ...
and the
CMU Sphinx CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model traine ...
speech recognition system. * Concept mining – *
Content determination Content determination is the subtask of natural language generation (NLG) that involves deciding on the information to be communicated in a generated text. It is closely related to the task of document structuring. Example Consider an NLG system w ...
– *
DATR DATR is a language for lexicon, lexical knowledge representation. The lexical knowledge is encoded in a network of nodes. Each node has a set of attributes encoded with it. A node can represent a word or a word form. DATR was developed in the lat ...
– *
DBpedia Spotlight DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantical ...
– * Deep linguistic processing – *
Discourse relation A discourse relation (also coherence relation or rhetorical relation) is a description of how two segments of discourse are logically and/or structurally connected to one another. A widely upheld position is that in coherent discourse, every ind ...
– *
Document-term matrix A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix i ...
– * Dragomir R. Radev – *
ETBLAST eTBLAST was a free text-similarity service now defunct. It was initially developed by Alexander Pertsemlidis and Harold “Skip” Garner in 2005 at The University of Texas Southwestern Medical Center. It offered access to the following databas ...
– * Filtered-popping recursive transition network – * Robby Garner – *
GeneRIF A GeneRIF or Gene Reference Into Function is a short (255 characters or fewer) statement about the function of a gene. GeneRIFs provide a simple mechanism for allowing scientists to add to the functional annotation of genes described in the Entrez ...
– * Gorn address – *
Grammar induction Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of ''re-write rules'' or '' productions'' or alternatively as a finite state machine or automaton of some kind) fr ...
– *
Grammatik ''Grammatik'' was the first grammar checking program developed for home computer systems. Aspen Software of Albuquerque, NM, released the earliest version of this diction and style checker for personal computers. It was first released no later ...
– * Hashing-Trick – *
Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
– *
Human language technology Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broa ...
– *
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
– *
International Conference on Language Resources and Evaluation The International Conference on Language Resources and Evaluation is an international conference organised by the European Language Resources Association every other year (on even years) with the support of institutions and organisations involved ...
– *
Kleene star In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...
– *
Language Computer Corporation Language Computer Corporation (LCC) is a natural language processing research company based in Richardson, Texas. The company develops a variety of natural language processing products, including software for question answering, information extr ...
– * Language model – *
LanguageWare LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, tex ...
– * Latent semantic mapping – *
Legal information retrieval Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal profe ...
– * Lesk algorithm – *
Lessac Technologies Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. The firm currently has seven patents granted and three more pending for its automated metho ...
– * Lexalytics – *
Lexical choice Lexical choice is the subtask of Natural language generation that involves choosing the content words (nouns, verbs, adjectives, and adverbs) in a generated text. Function words (determiners, for example) are usually chosen during realisation. Ex ...
– *
Lexical Markup Framework Language resource management - Lexical markup framework (LMF; ISO 24613:2008), is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope ...
– *
Lexical substitution Lexical substitution is the task of identifying a substitute for a word in the context of a clause. For instance, given the following text: "After the ''match'', replace any remaining fluid deficit to prevent chronic dehydration throughout the tour ...
– * LKB – * Logic form – *
LRE Map The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major N ...
– * Machine translation software usability – * MAREC – * Maximum entropy – *
Message Understanding Conference The Message Understanding Conferences (MUC) for computing and computer science, were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. The c ...
– *
METEOR A meteoroid () is a small rocky or metallic body in outer space. Meteoroids are defined as objects significantly smaller than asteroids, ranging in size from grains to objects up to a meter wide. Objects smaller than this are classified as micr ...
– * Minimal recursion semantics – * Morphological pattern – *
Multi-document summarization Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The resulting summary report allows individual users, such as professional information consumers, to quickl ...
– * Multilingual notation – * Naive semantics – *
Natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
– * Natural-language interface – * Natural-language user interface – * News analytics – * Nondeterministic polynomial – *
Open domain question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural ...
– *
Optimality theory In linguistics, Optimality Theory (frequently abbreviated OT) is a linguistic model proposing that the observed forms of language arise from the optimal satisfaction of conflicting constraints. OT differs from other approaches to phonological ...
– *
Paco Nathan Paco Nathan (born 1962) is an American computer scientist and early engineer of the World Wide Web. Nathan is also an author and performance art show producer who established much of his career in Austin, Texas. Early life Paco Nathan was brought ...
– *
Phrase structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue (Post canonical systems). Some authors, however, reserve the term for more restricted grammars in the ...
– *
Powerset (company) Powerset was an American company based in San Francisco, California, that, in 2006, was developing a natural language search engine for the Internet. On July 1, 2008, Powerset was acquired by Microsoft for an estimated $100 million. Powerset w ...
– *
Production (computer science) A production or production rule in computer science is a '' rewrite rule'' specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions P is the main component in the specificati ...
– *
PropBank PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer ''et al.'', the term ''propbank'' is also coming to be used a ...
– *
Question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
– *
Realization (linguistics) In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. P ...
– *
Recursive transition network A recursive transition network ("RTN") is a graph theoretical schematic used to represent the rules of a context-free grammar. RTNs have application to programming languages, natural language and lexical analysis. Any sentence that is construct ...
– *
Referring expression generation Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the ...
– * Rewrite rule – *
Semantic compression In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas ca ...
– *
Semantic neural network Semantic neural network (SNN) is based on John von Neumann's neural network von_Neumann,_1966.html" ;"title="/nowiki>von Neumann, 1966">/nowiki>von Neumann, 1966/nowiki> and Nikolai Amosov M-Network. There are limitations to a link topology for t ...
– *
SemEval SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. ...
– *
SPL notation SPL (Sentence Plan Language) is an abstract notation representing the semantics of a sentence in natural language. In a classical Natural Language Generation (NLG) workflow, an initial text plan (hierarchically or sequentially organized factoids, ...
– *
Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
– reduces an inflected or derived word into its
word stem In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem is ...
, base, or
root In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...
form. *
String kernel In machine learning and data mining, a string kernel is a Positive-definite kernel, kernel function that operates on String (computer science), strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intu ...


Natural-language processing tools

*
Google Ngram Viewer The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 in Google's text cor ...
– graphs ''n''-gram usage from a corpus of more than 5.2 million books


Corpora

*
Text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
(see
list A ''list'' is any set of items in a row. List or lists may also refer to: People * List (surname) Organizations * List College, an undergraduate division of the Jewish Theological Seminary of America * SC Germania List, German rugby union ...
) – large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. **
Bank of English The Bank of English is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth ...
**
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
**
Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of C ...
(COCA) **
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...


Natural-language processing toolkits

The following natural-language processing toolkits are notable collections of
natural-language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
software. They are suites of
libraries A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...
,
frameworks A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of. Framework may refer to: Computing * Application framework, used to implement the structure of an application for an op ...
, and
applications Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a c ...
for symbolic, statistical natural-language and speech processing.


Named-entity recognizers

* ABNER (A Biomedical Named-Entity Recognizer) – open source text mining program that uses linear-chain conditional random field sequence models. It automatically tags genes, proteins and other entity names in text. Written by Burr Settles of the University of Wisconsin-Madison. * Stanford NER (Named-Entity Recognizer) — Java implementation of a Named-Entity Recognizer that uses linear-chain conditional random field sequence models. It automatically tags persons, organizations, and locations in text in English, German, Chinese, and Spanish languages. Written by Jenny Finkel and other members of the Stanford NLP Group at Stanford University.


Translation software

*
Comparison of machine translation applications Machine translation is an algorithm which attempts to translate text or speech from one natural language to another. General information Basic general information for popular machine translation applications. Languages features compariso ...
* Machine translation applications **
Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API t ...
** DeepL **
Linguee Linguee is an online bilingual concordance that provides an online dictionary for a number of language pairs, including many bilingual sentence pairs. As a translation aid, Linguee differs from machine translation services like Babel Fish and is ...
– web service that provides an online dictionary for a number of language pairs. Unlike similar services, such as LEO, Linguee incorporates a search engine that provides access to large amounts of bilingual, translated sentence pairs, which come from the World Wide Web. As a translation aid, Linguee therefore differs from machine translation services like Babelfish and is more similar in function to a translation memory. ** UNL Universal Networking Language **
Yahoo! Babel Fish Yahoo! Babel Fish was a free Web-based multilingual translation application. In May 2012 it was replaced by Bing Translator (now Microsoft Translator), to which queries were redirected. Although Yahoo! has transitioned its Babel Fish translation s ...
** Reverso


Other software

*
CTAKES Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, id ...
– open-source natural-language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context (family history of, current, unrelated to patient), and negated/not negated. Also known as Apache cTAKES. * DMAP – *
ETAP-3 ETAP-3 is a proprietary linguistic processing system focusing on English language, English and Russian language, Russian. It was developed in Moscow, Russia at the Institute for Information Transmission Problems (:ru:Институт проблем ...
– proprietary linguistic processing system focusing on English and Russian. It is a
rule-based system In computer science, a rule-based system is used to store and manipulate knowledge to interpret information in a useful way. It is often used in artificial intelligence applications and research. Normally, the term ''rule-based system'' is appli ...
which uses the Meaning-Text Theory as its theoretical foundation. *
JAPE Jape is a synonym for a practical joke. Jape or JAPE may also refer to: * Jape (band), an Irish electronic/rock band * JAPE (linguistics), a transformation language widely used in natural language processing * JAPE, an automated pun generator * J ...
– the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. *
LOLITA ''Lolita'' is a 1955 novel written by Russian-American novelist Vladimir Nabokov. The novel is notable for its controversial subject: the protagonist and unreliable narrator, a middle-aged literature professor under the pseudonym Humbert Humber ...
– "Large-scale, Object-based, Linguistic Interactor, Translator and Analyzer". LOLITA was developed by Roberto Garigliano and colleagues between 1986 and 2000. It was designed as a general-purpose tool for processing unrestricted text that could be the basis of a wide variety of applications. At its core was a semantic network containing some 90,000 interlinked concepts. * Maluuba – intelligent personal assistant for Android devices, that uses a contextual approach to search which takes into account the user's geographic location, contacts, and language. * METAL MT – machine translation system developed in the 1980s at the University of Texas and at Siemens which ran on Lisp Machines. *
Never-Ending Language Learning Never-Ending Language Learning system (NELL) is a semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, NSF, and CNPq with portions of the system running on a superc ...
– semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, and the NSF, with portions of the system running on a supercomputing cluster provided by Yahoo!. NELL was programmed by its developers to be able to identify a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams. Since the beginning of 2010, the Carnegie Mellon research team has been running NELL around the clock, sifting through hundreds of millions of web pages looking for connections between the information it already knows and what it finds through its search process – to make new connections in a manner that is intended to mimic the way humans learn new information. *
NLTK The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...
– * Online-translator.com – * Regulus Grammar Compiler – software system for compiling unification grammars into grammars for speech recognition systems. * S Voice – *
Siri (software) Siri ( ) is a virtual assistant that is part of Apple Inc.'s iOS, iPadOS, watchOS, macOS, tvOS, and audioOS operating systems. It uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer quest ...
– * Speaktoit – * TeLQAS – * Weka's classification tools – *
word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
– models that were developed by a team of researchers led by Thomas Milkov at Google to generate word embeddings that can reconstruct some of the linguistic context of words using shallow, two dimensional neural nets derived from a much larger vector space. *
Festival Speech Synthesis System The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Subst ...
– *
CMU Sphinx CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model traine ...
speech recognition system – * Language Grid – Open source platform for language web services, which can customize language services by combining existing language services.


Chatterbots

Chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beha ...
– a text-based conversation
agent Agent may refer to: Espionage, investigation, and law *, spies or intelligence officers * Law of agency, laws involving a person authorized to act on behalf of another ** Agent of record, a person with a contractual agreement with an insuranc ...
that can interact with human users through some medium, such as an
instant message Instant messaging (IM) technology is a type of online chat allowing real-time text transmission over the Internet or another computer network. Messages are typically transmitted between two or more parties, when each user inputs text and trigge ...
service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.


Classic chatterbots

*
Dr. Sbaitso Dr. Sbaitso is an artificial intelligence speech synthesis program released late in 1991 by Creative Labs in Singapore for MS-DOS-based personal computers. The name is an acronym for "SoundBlaster Acting Intelligent Text-to-Speech Operator." ...
*
ELIZA ELIZA is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum. Created to demonstrate the superficiality of communication between humans and machines, E ...
*
PARRY PARRY was an early example of a chatbot, implemented in 1972 by psychiatrist Kenneth Colby. History PARRY was written in 1972 by psychiatrist Kenneth Colby, then at Stanford University. While ELIZA was a tongue-in-cheek simulation of a Rogeria ...
*
Racter ''Racter'' is an artificial intelligence computer program that generates English language prose at random. It was published in 1984 by Mindscape. History Racter, short for ''raconteur'', was written by William Chamberlain and Thomas Etter. The e ...
(or Claude Chatterbot) *
Mark V Shaney Mark V. Shaney is a synthetic Usenet user whose postings in the ''net.singles'' newsgroups were generated by Markov chain techniques, based on text from other postings. The username is a play on the words "Markov chain". Many readers were foole ...


General chatterbots

* Albert One – 1998 and 1999 Loebner winner, by Robby Garner. *
A.L.I.C.E. A.L.I.C.E. (Artificial Linguistic Internet Computer Entity), also referred to as Alicebot, or simply Alice, is a natural language processing chatterbot—a program that engages in a conversation with a human by applying some heuristical pattern ...
– 2001, 2002, and 2004
Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...
winner developed by Richard Wallace. * Charlix *
Cleverbot Cleverbot is a chatterbot web application that uses machine learning techniques to have conversations with humans. It was created by British AI scientist Rollo Carpenter. It was preceded by Jabberwacky, a chatbot project that began in 1988 and ...
(winner of the 2010 Mechanical Intelligence Competition) * Elbot – 2008
Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...
winner, by
Fred Roberts Frederick Clark Roberts (born August 14, 1960) is an American former basketball player who played power forward in the National Basketball Association (NBA) for 13 seasons, a career spanning from 1983 to 1997, becoming a successful journeymen in ...
. *
Eugene Goostman Eugene Goostman is a chatbot that some regard as having passed the Turing test, a test of a computer's ability to communicate indistinguishably from a human. Developed in Saint Petersburg in 2001 by a group of three programmers, the Russian-born ...
– 2012 Turing 100 winner, by Vladimir Veselov. *
Fred Fred may refer to: People * Fred (name), including a list of people and characters with the name Mononym * Fred (cartoonist) (1931–2013), pen name of Fred Othon Aristidès, French * Fred (footballer, born 1949) (1949–2022), Frederico Ro ...
– an early chatterbot by Robby Garner. *
Jabberwacky Jabberwacky is a chatterbot created by British programmer Rollo Carpenter. Its stated aim is to "simulate natural human chat in an interesting, entertaining and humorous manner". It is an early attempt at creating an artificial intelligence th ...
*
Jeeney AI Jeeney AI is a natural language processing chatterbot. History Jeeny AI was named "Best Overall Bot" in the 2009 Chatterbox Challenge, after ranking seventh, but being the "Best New Entry" in the previous year.MegaHAL MegaHAL is a computer conversation simulator, or "chatterbot", created by Jason Hutchens. Background In 1996, Jason Hutchens entered the Loebner Prize Contest witHeX a chatterbot based on ELIZA. HeX won the competition that year and took the $20 ...
*
Mitsuku Kuki is an embodied AI bot designed to befriend humans in the metaverse. Formerly known as Mitsuku, Kuki is a chatbot created from Pandorabots AIML technology by Steve Worswick. It is a five-time winner of a Turing Test competition called the ...
, 2013 and 2016
Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...
winner * Rose - ... 2015 - 3x
Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...
winner, by
Bruce Wilcox Bruce Wilcox (born 19) is an artificial intelligence programmer. Work MTS/LISP and Computer Go A graduate of Michigan, Wilcox wrote the MTS/LISP interpreter (the LISP system used at the University of Michigan and a consortium of other places inc ...
. *
SimSimi SimSimi is an artificial intelligence conversation program created in 2002 by ISMaker. It grows its artificial intelligence day by day assisted by a feature that allows users to teach it to respond correctly. SimSimi, pronounced as "shim-shimi", ...
– A popular artificial intelligence conversation program that was created in 2002 by ISMaker. * Spookitalk – A chatterbot used for
NPCs A non-player character (NPC), or non-playable character, is any character in a game that is not controlled by a player. The term originated in traditional tabletop role-playing games where it applies to characters controlled by the gamemaster o ...
in
Douglas Adams Douglas Noel Adams (11 March 1952 – 11 May 2001) was an English author and screenwriter, best known for ''The Hitchhiker's Guide to the Galaxy''. Originally a 1978 BBC radio comedy, ''The Hitchhiker's Guide to the Galaxy'' developed into a " ...
' ''
Starship Titanic ''Starship Titanic'' is an adventure game developed by The Digital Village and published by Simon & Schuster Interactive. It was released in April 1998 for Microsoft Windows and in March 1999 for Apple Macintosh. The game takes place on the ep ...
'' video game. * Ultra Hal – 2007
Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...
winner, by Robert Medeksza. *
Verbot The Verbot (Verbal-Robot) was a popular chatterbot program and Artificial Intelligence Software Development Kit (SDK) for Windows and the web. Early beginning virtual personalities, Inc. traces its technology back to Michael Mauldin's work a ...


Instant messenger chatterbots

* GooglyMinotaur, specializing in
Radiohead Radiohead are an English rock band formed in Abingdon, Oxfordshire, in 1985. The band consists of Thom Yorke (vocals, guitar, piano, keyboards); brothers Jonny Greenwood (lead guitar, keyboards, other instruments) and Colin Greenwood (bass) ...
, the first bot released by
ActiveBuddy Colloquis, previously known as ActiveBuddy and Conversagent, was a company that created conversation-based interactive agents originally distributed via instant messaging platforms. The company had offices in New York, NY and Sunnyvale, CA. Found ...
(June 2001-March 2002) *
SmarterChild SmarterChild was a chatbot available on AOL Instant Messenger and Windows Live Messenger (previously MSN Messenger) networks. History SmarterChild was an intelligent agent or "bot" developed by ActiveBuddy, Inc., with offices in New York and Su ...
, developed by
ActiveBuddy Colloquis, previously known as ActiveBuddy and Conversagent, was a company that created conversation-based interactive agents originally distributed via instant messaging platforms. The company had offices in New York, NY and Sunnyvale, CA. Found ...
and released in June 2001 * Infobot, an assistant on
IRC Internet Relay Chat (IRC) is a text-based chat system for instant messaging. IRC is designed for group communication in discussion forums, called '' channels'', but also allows one-on-one communication via private messages as well as chat an ...
channels such as ''#perl'', primarily to help out with answering Frequently Asked Questions (June 1995-today) * Negobot, a bot designed to catch online pedophiles by posing as a young girl and attempting to elicit personal details from people it speaks to.


Natural-language processing organizations

* AFNLP (Asian Federation of Natural Language Processing Associations) – the organization for coordinating the natural-language processing related activities and events in the Asia-Pacific region. *
Australasian Language Technology Association The Australasian Language Technology Association (ALTA) promotes language technology research and development in Australia and New Zealand. ALTA organises regular events for the exchange of research results and for academic and industrial training ...
– *
Association for Computational Linguistics The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language proces ...
– international scientific and professional society for people working on problems involving natural-language processing.


Natural-language processing-related conferences

*
Annual Meeting of the Association for Computational Linguistics Annual may refer to: * Annual publication, periodical publications appearing regularly once per year **Yearbook ** Literary annual * Annual plant * Annual report * Annual giving * Annual, Morocco, a settlement in northeastern Morocco * Annuals (b ...
(ACL) *
International Conference on Intelligent Text Processing and Computational Linguistics CICLing (International Conference on Computational Linguistics and Intelligent Text Processing; before 2017 known under the name International Conference on Intelligent Text Processing and Computational Linguistics) is an annual conference on c ...
(CICLing) *
International Conference on Language Resources and Evaluation The International Conference on Language Resources and Evaluation is an international conference organised by the European Language Resources Association every other year (on even years) with the support of institutions and organisations involved ...
– biennial conference organised by the European Language Resources Association with the support of institutions and organisations involved in natural-language processing * Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) * Text, Speech and Dialogue (TSD) – annual conference *
Text Retrieval Conference The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or ''tracks.'' It is co-sponsored by the National Institute of Standards and Technology (NIST) an ...
(TREC) – on-going series of workshops focusing on various information retrieval (IR) research areas, or tracks


Companies involved in natural-language processing

*
AlchemyAPI AlchemyAPI was a software company in the field of machine learning. Its technology employed deep learning for various applications in natural language processing, such as semantic text analysis and sentiment analysis, as well as computer vision ...
– service provider of a natural-language processing API. * Google, Inc. – the Google search engine is an example of automatic summarization, utilizing keyphrase extraction. * Calais (Reuters product) – provider of a natural-language processing services. *
Wolfram Research, Inc. Wolfram Research, Inc. ( ) is an American multinational company that creates computational technology. Wolfram's flagship product is the technical computing program Wolfram Mathematica, first released on June 23, 1988. Other products include W ...
developer of natural-language processing computation engine
Wolfram Alpha WolframAlpha ( ) is an answer engine developed by Wolfram Research. It answers factual queries by computing answers from externally sourced data. WolframAlpha was released on May 18, 2009 and is based on Wolfram's earlier product Wolfram Mat ...
.


Natural-language processing publications


Books

*
Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing
' – Wermter, S., Riloff E. and Scheler, G. (editors). First book that addressed statistical and neural network learning of language. *

' – by
Daniel Jurafsky Daniel Jurafsky is a professor of linguistics and computer science at Stanford University, and also an author. With Daniel Gildea, he is known for developing the first automatic system for semantic role labeling (SRL). He is the author of ''The ...
and James H. Martin. Introductory book on language technology.


Book series

* ''
Studies in Natural Language Processing ''Studies in Natural Language Processing'' is the book series of the Association for Computational Linguistics, published by Cambridge University PressSteven Birdis the series editor. The editorial board has the following membersChu-Ren Huang Chair ...
'' – book series of the Association for Computational Linguistics, published by Cambridge University Press.


Journals

* ''
Computational Linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
'' – peer-reviewed academic journal in the field of computational linguistics. It is published quarterly by MIT Press for the Association for Computational Linguistics (ACL)


People influential in natural-language processing

* Daniel Bobrow – *
Rollo Carpenter Rollo Carpenter (born 1965) is the British-born creator of Jabberwacky and Cleverbot, learning Artificial Intelligence (AI) software. Carpenter worked as CTO of a business software startup in Silicon Valley. Career His brother is the artist Me ...
– creator of Jabberwacky and Cleverbot. *
Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...
– author of the seminal work ''
Syntactic Structures ''Syntactic Structures'' is an influential work in linguistics by American linguist Noam Chomsky, originally published in 1957. It is an elaboration of his teacher Zellig Harris's model of transformational generative grammar. A short monograph ...
'', which revolutionized Linguistics with '
universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible hum ...
', a rule based system of syntactic structures. *
Kenneth Colby Kenneth Mark Colby (1920 – April 20, 2001) was an American psychiatrist dedicated to the theory and application of computer science and artificial intelligence to psychiatry. Colby was a pioneer in the development of computer technology as ...
– * David Ferrucci – principal investigator of the team that created Watson, IBM's AI computer that won the quiz show ''Jeopardy!'' *
Lyn Frazier Lyn Frazier (born October 15, 1952, in Madison, Wisconsin) is an experimental linguist, focusing on Psycholinguistics, psycholinguistic research of adult Sentence Comprehension, sentence comprehension. She is professor emerita at the University of ...
– *
Daniel Jurafsky Daniel Jurafsky is a professor of linguistics and computer science at Stanford University, and also an author. With Daniel Gildea, he is known for developing the first automatic system for semantic role labeling (SRL). He is the author of ''The ...
– Professor of Linguistics and Computer Science at Stanford University. With James H. Martin, he wrote the textbook ''Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics'' *
Roger Schank Roger Carl Schank (born 1946) is an American artificial intelligence theorist, cognitive psychologist, learning scientist, educational reformer, and entrepreneur. Beginning in the late 1960s, he pioneered conceptual dependency theory (within th ...
– introduced the
conceptual dependency theory Conceptual dependency theory is a model of natural language understanding used in artificial intelligence systems. Roger Schank at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was exten ...
for natural-language understanding. * Jean E. Fox Tree – *
Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical com ...
– originator of the
Turing Test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to artificial intelligence, exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing propos ...
. *
Joseph Weizenbaum Joseph Weizenbaum (8 January 1923 – 5 March 2008) was a German American computer scientist and a professor at MIT. The Weizenbaum Award is named after him. He is considered one of the fathers of modern artificial intelligence. Life and caree ...
– author of the
ELIZA ELIZA is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum. Created to demonstrate the superficiality of communication between humans and machines, E ...
chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beha ...
. *
Terry Winograd Terry Allen Winograd (born February 24, 1946) is an American professor of computer science at Stanford University, and co-director of the Stanford Human–Computer Interaction Group. He is known within the philosophy of mind and artificial intell ...
– professor of computer science at Stanford University, and co-director of the Stanford Human-Computer Interaction Group. He is known within the philosophy of mind and artificial intelligence fields for his work on natural language using the SHRDLU program. * William Aaron Woods – * Maurice Gross – author of the concept of local grammar,Ibrahim, Amr Helmy. 2002. "Maurice Gross (1934-2001). À la mémoire de Maurice Gross". ''Hermès'' 34.
/ref> taking finite automata as the competence model of language.Dougherty, Ray. 2001. ''Maurice Gross Memorial Letter''.
/ref> *
Stephen Wolfram Stephen Wolfram (; born 29 August 1959) is a British-American computer scientist, physicist, and businessman. He is known for his work in computer science, mathematics, and theoretical physics. In 2012, he was named a fellow of the American Ma ...
– CEO and founder of
Wolfram Research Wolfram Research, Inc. ( ) is an American multinational company that creates computational technology. Wolfram's flagship product is the technical computing program Wolfram Mathematica, first released on June 23, 1988. Other products include ...
, creator of the programming language (natural-language understanding)
Wolfram Language The Wolfram Language ( ) is a general multi-paradigm programming language developed by Wolfram Research. It emphasizes symbolic computation, functional programming, and rule-based programming and can employ arbitrary structures and data. It ...
, and natural-language processing computation engine
Wolfram Alpha WolframAlpha ( ) is an answer engine developed by Wolfram Research. It answers factual queries by computing answers from externally sourced data. WolframAlpha was released on May 18, 2009 and is based on Wolfram's earlier product Wolfram Mat ...
. *
Victor Yngve Victor H. Yngve (July 5, 1920 – January 15, 2012W. John HutchinVictor Yngve obituary aclweb.org; accessed August 15, 2017.) was professor of linguistics at the University of Chicago and the Massachusetts Institute of Technology (1953-1965). H ...


See also


References


Bibliography

* * . * .


External links

{{Outline footer *
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...