The following outline is provided as an overview of and topical guide to natural-language processing: natural-language processing – computer activity in which computers are entailed to analyze, understand, alter, or generate

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...

. This includes the

automation Automation describes a wide range of technologies that reduce human intervention in processes, namely by predetermining decision criteria, subprocess relationships, and related actions, as well as embodying those predeterminations in machines ...

of any or all linguistic forms, activities, or methods of communication, such as

conversation Conversation is interactive communication between two or more people. The development of conversational skills and etiquette is an important part of socialization. The development of conversational skills in a new language is a frequent focus ...

, correspondence,

reading Reading is the process of taking in the sense or meaning of Letter (alphabet), letters, symbols, etc., especially by Visual perception, sight or Somatosensory system, touch. For educators and researchers, reading is a multifaceted process invo ...

, written composition, dictation, publishing,

translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...

lip reading The lips are the visible body part at the mouth of many animals, including humans. Lips are soft, movable, and serve as the opening for food intake and in the articulation of sound and speech. Human lips are a tactile sensory organ, and can be ...

, and so on. Natural-language processing is also the name of the branch of

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machine A machine is a physical system using Power (physics), power to apply Force, forces and control Motion, moveme ...

, and

linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...

concerned with enabling computers to engage in communication using natural language(s) in all forms, including but not limited to

speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...

print Printing is the process for reproducing text and images using a master form or template Print or printing may also refer to: Publishing * Canvas print, the result of an image printed onto canvas which is often stretched, or gallery-wrapped, o ...

writing Writing is a medium of human communication which involves the representation of a language through a system of physically Epigraphy, inscribed, Printing press, mechanically transferred, or Word processor, digitally represented Symbols (semiot ...

, and signing.

Natural-language processing

Natural-language processing can be described as all of the following: * A field of

science Science is a systematic endeavor that Scientific method, builds and organizes knowledge in the form of Testability, testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earli ...

– systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ** An

applied science Applied science is the use of the scientific method and knowledge obtained via conclusions from the method to attain practical goals. It includes a broad range of disciplines such as engineering and medicine. Applied science is often contrasted ...

– field that applies human knowledge to build or design useful things. *** A field of

– scientific and practical approach to computation and its applications. **** A branch of

– intelligence of machines and robots and the branch of computer science that aims to create it. **** A subfield of

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

– interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective. ** An application of

engineering Engineering is the use of scientific method, scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad rang ...

– science, skill, and profession of acquiring and applying scientific, economic, social, and practical knowledge, in order to design and also build structures, machines, devices, systems, materials and processes. *** An application of

software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term ' ...

– application of a systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software. SWEBOK **** A subfield of

computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as anal ...

– process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages (such as Java, C++, C#, Python, etc.). The purpose of programming is to create a set of instructions that computers use to perform specific operations or to exhibit desired behaviors. ***** A subfield of

programming – * A type of system – set of interacting or interdependent components forming an integrated whole or a set of elements (often called 'components' ) and relationships which are different from relationships of the set or its elements to other elements or sets. ** A system that includes

software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...

– software is a collection of computer programs and related data that provides the instructions for telling a computer what to do and how to do it. Software refers to one or more computer programs and data held in the storage of the computer. In other words, software is a set of programs, procedures, algorithms and its documentation concerned with the operation of a data processing system. * A type of

technology Technology is the application of knowledge to reach practical goals in a specifiable and reproducible way. The word ''technology'' may also mean the product of such an endeavor. The use of technology is widely prevalent in medicine, scie ...

– making, modification, usage, and knowledge of tools, machines, techniques, crafts, systems, methods of organization, in order to solve a problem, improve a preexisting solution to a problem, achieve a goal, handle an applied input/output relation or perform a specific function. It can also refer to the collection of such tools, machinery, modifications, arrangements and procedures. Technologies significantly affect human as well as other animal species' ability to control and adapt to their natural environments. ** A form of

computer technology Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...

– computers and their application. NLP makes use of computers, image scanners, microphones, and many types of software programs. *** Language technology – consists of natural-language processing (NLP) and computational linguistics (CL) on the one hand, and speech technology on the other. It also includes many application oriented aspects of these. It is often called human language technology (HLT).

Prerequisite technologies

The following technologies make natural-language processing possible: *

Communication Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inqu ...

– the activity of a source sending a message to a receiver **

Language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...

– ***

Speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...

– ***

Writing Writing is a medium of human communication which involves the representation of a language through a system of physically Epigraphy, inscribed, Printing press, mechanically transferred, or Word processor, digitally represented Symbols (semiot ...

– **

Computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...

– *** Computers – ***

Computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as anal ...

– ****

Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

– ****

User interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...

– ***

Software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...

– ****

Text editing A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be us ...

– program used to edit plain

text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operat ...

s ****

Word processing A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...

– piece of software used for composing, editing, formatting, printing documents ***

Input device In computing, an input device is a piece of equipment used to provide data and control signals to an information processing system, such as a computer or information appliance. Examples of input devices include keyboards, mouse, scanners, cameras ...

s – pieces of hardware for sending data to a computer to be processed ****

Computer keyboard A computer keyboard is a peripheral input device modeled after the typewriter keyboard which uses an arrangement of buttons or keys to act as mechanical levers or electronic switches. Replacing early punched cards and paper tape technology ...

– typewriter style input device whose input is converted into various data depending on the circumstances ****

Image scanner An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop ''flatbed scanner'' ...

s –

Subfields of natural-language processing

(IE) – field concerned in general with the extraction of semantic information from text. This covers tasks such as

named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre- ...

, coreference resolution,

relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE a ...

, etc. *

Ontology engineering In computer science, information science and systems engineering, ontology engineering is a field which studies the methods and methodologies for building ontologies, which encompasses a representation, formal naming and definition of the categ ...

– field that studies the methods and methodologies for building ontologies, which are formal representations of a set of concepts within a domain and the relationships between those concepts. *

Speech processing Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied ...

– field that covers

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

and related tasks. * Statistical natural-language processing – ** Statistical semantics – a subfield of

computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computati ...

that establishes semantic relations between words to examine their contexts. ***

Distributional semantics Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...

– a subfield of statistical semantics that examines the semantic relationship of words across a corpora or in large samples of data.

Related fields

Natural-language processing contributes to, and makes use of (the theories, tools, and methodologies from), the following fields: *

Automated reasoning In computer science, in particular in knowledge representation and reasoning and metalogic, the area of automated reasoning is dedicated to understanding different aspects of reasoning. The study of automated reasoning helps produce computer progr ...

– area of computer science and mathematical logic dedicated to understanding various aspects of reasoning, and producing software which allows computers to reason completely, or nearly completely, automatically. A sub-field of artificial intelligence, automatic reasoning is also grounded in theoretical computer science and philosophy of mind. *

Linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...

– scientific study of human language. Natural-language processing requires understanding of the structure and application of language, and therefore it draws heavily from linguistics. **

Applied linguistics Applied linguistics is an interdisciplinary field which identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, psychology, communication res ...

– interdisciplinary field of study that identifies, investigates, and offers solutions to language-related real-life problems. Some of the academic fields related to applied linguistics are education, linguistics, psychology, computer science, anthropology, and sociology. Some of the subfields of applied linguistics relevant to natural-language processing are: *** Bilingualism / Multilingualism – ***

Computer-mediated communication Computer-mediated communication (CMC) is defined as any human communication that occurs through the use of two or more electronic devices. While the term has traditionally referred to those communications that occur via computer-mediated forma ...

(CMC) – any communicative transaction that occurs through the use of two or more networked computers. Research on CMC focuses largely on the social effects of different computer-supported communication technologies. Many recent studies involve Internet-based

social networking A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for a ...

supported by

social software Social software, also known as social apps or social platform, include communications and interactive tools that are often based on the Internet. Communication tools typically handle the capturing, storing and presentation of communication, usua ...

. ***

Contrastive linguistics Contrastive linguistics is a practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages (hence it is occasionally called "''differential'' linguistics"). History While traditional ...

– practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages. ***

Conversation analysis Conversation analysis (CA) is an approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life. CA originated as a sociological method, but has since spread to other fields. CA began wit ...

(CA) – approach to the study of social interaction, embracing both verbal and non-verbal conduct, in situations of everyday life.

Turn-taking Turn-taking is a type of organization in conversation and discourse where participants speak one at a time in alternating turns. In practice, it involves processes for constructing contributions, responding to previous comments, and transitioning ...

is one aspect of language use that is studied by CA. ***

Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event. The objects of discourse Analysis ( discourse, writing, conversation, communicative even ...

– various approaches to analyzing written, vocal, or sign language use or any significant semiotic event. *** Forensic linguistics – application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. ***

Interlinguistics Interlinguistics, as the science of planned languages, has existed for more than a century as a specific branch of linguistics for the study of various aspects of linguistic communication. Interlinguistics is a discipline formalized by Otto Jespers ...

– study of improving communications between people of different first languages with the use of ethnic and auxiliary languages (lingua franca). For instance by use of intentional international auxiliary languages, such as Esperanto or Interlingua, or spontaneous interlanguages known as pidgin languages. ***

Language assessment Language assessment or language testing is a field of study under the umbrella of applied linguistics. Its main focus is the assessment of first, second or other language in the school, college, or university context; assessment of language use ...

– assessment of first, second or other language in the school, college, or university context; assessment of language use in the workplace; and assessment of language in the immigration, citizenship, and asylum contexts. The assessment may include analyses of listening, speaking, reading, writing or cultural understanding, with respect to understanding how the language works theoretically and the ability to use the language practically. ***

Language pedagogy Language pedagogy is the discipline concerned with the theories and techniques of teaching language. It has been described as a type of teaching wherein the teacher draws from his prior knowledge and actual experience in teaching language. The appr ...

– science and art of language education, including approaches and methods of language teaching and study. Natural-language processing is used in programs designed to teach language, including first- and second-language training. ***

Language planning In sociolinguistics, language planning (also known as language engineering) is a deliberate effort to influence the function, structure or acquisition of languages or language varieties within a speech community.Kaplan B., Robert, and Richard ...

– ***

Language policy Language policy is an interdisciplinary academic field. Some scholars such as Joshua Fishman and Ofelia García consider it as part of sociolinguistics. On the other hand, other scholars such as Bernard SpolskyRobert B. Kaplanand Joseph Lo Bianc ...

– ***

Lexicography Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoret ...

– *** Literacies – ***

Pragmatics In linguistics and related fields, pragmatics is the study of how context contributes to meaning. The field of study evaluates how human language is utilized in social interactions, as well as the relationship between the interpreter and the in ...

– ***

Second-language acquisition Second-language acquisition (SLA), sometimes called second-language learning — otherwise referred to as L2 (language 2) acquisition, is the process by which people learn a second language. Second-language acquisition is also the scientific dis ...

– ***

Stylistics Stylistics, a branch of applied linguistics, is the study and interpretation of texts of all types and/or spoken language in regard to their linguistic and tonal style, where style is the particular variety of language used by different individu ...

– ***

Translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...

– **

Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

– interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective. The models and tools of computational linguistics are used extensively in the field of natural-language processing, and vice versa. ***

Computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computati ...

– ***

Corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...

– study of language as expressed in samples ''(corpora)'' of "real world" text. ''Corpora'' is the plural of ''corpus'', and a corpus is a specifically selected collection of texts (or speech segments) composed of natural language. After it is constructed (gathered or composed), a corpus is analyzed with the methods of computational linguistics to infer the meaning and context of its components (words, phrases, and sentences), and the relationships between them. Optionally, a corpus can be annotated ("tagged") with data (manually or automatically) to make the corpus easier to understand (e.g.,

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...

). This data is then applied to make sense of user input, for example, to make better (automated) guesses of what people are talking about or saying, perhaps to achieve more narrowly focused web searches, or for speech recognition. **

Metalinguistics Metalinguistics is the branch of linguistics that studies language and its relationship to other cultural behaviors. It is the study of dialogue relationships between units of speech communication as manifestations and enactments of co-existen ...

– ** Sign linguistics – scientific study and analysis of natural sign languages, their features, their structure (phonology, morphology, syntax, and semantics), their acquisition (as a primary or secondary language), how they develop independently of other languages, their application in communication, their relationships to other languages (including spoken languages), and many other aspects. *

Human–computer interaction Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people ( users) and computers. HCI researchers observe the ways humans interact with computers and design ...

– the intersection of computer science and behavioral sciences, this field involves the study, planning, and design of the interaction between people (users) and computers. Attention to human-machine interaction is important, because poorly designed human-machine interfaces can lead to many unexpected problems. A classic example of this is the

Three Mile Island accident The Three Mile Island accident was a partial meltdown of the Three Mile Island, Unit 2 (TMI-2) reactor in Pennsylvania, United States. It began at 4 a.m. on March 28, 1979. It is the most significant accident in U.S. commercial nuclea ...

where investigations concluded that the design of the human–machine interface was at least partially responsible for the disaster. * Information retrieval (IR) – field concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP. *

Knowledge representation Knowledge representation and reasoning (KRR, KR&R, KR²) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medic ...

(KR) – area of artificial intelligence research aimed at representing knowledge in symbols to facilitate inferencing from those knowledge elements, creating new elements of knowledge. Knowledge Representation research involves analysis of how to reason accurately and effectively and how best to use a set of symbols to represent a set of facts within a knowledge domain. **

Semantic network A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices ...

– study of semantic relations between concepts. *** Semantic Web – *

Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

– subfield of computer science that examines pattern recognition and computational learning theory in artificial intelligence. There are three broad approaches to machine learning.

Supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

occurs when the machine is given example inputs and outputs by a teacher so that it can learn a rule that maps inputs to outputs.

Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

occurs when the machine determines the inputs structure without being provided example inputs or outputs.

Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

occurs when a machine must perform a goal without teacher feedback. **

Pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphic ...

– branch of

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

that examines how machines recognize regularities in data. As with machine learning, teachers can train machines to recognize patterns by providing them with example inputs and outputs (i.e.

Supervised Learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

), or the machines can recognize patterns without being trained on any example inputs or outputs (i.e.

Unsupervised Learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

). **

Statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diag ...

–

Structures used in natural-language processing

* Anaphora – type of expression whose reference depends upon another referential element. E.g., in the sentence 'Sally preferred the company of herself', 'herself' is an anaphoric expression in that it is coreferential with 'Sally', the sentence's subject. *

Context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...

– *

Controlled natural language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types ...

– a natural language with a restriction introduced on its grammar and vocabulary in order to eliminate ambiguity and complexity * Corpus – body of data, optionally tagged (for example, through

), providing real world samples for analysis and comparison. **

Text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...

– large and structured set of texts, nowadays usually electronically stored and processed. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific subject (or ''domain''). **

Speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or sp ...

– database of speech audio files and text transcriptions. In Speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition engine). In Linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields. *

Grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...

– **

Context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be ...

(CFG) – ** Constraint grammar (CG) – ** Definite clause grammar (DCG) – **

Functional unification grammar Functional may refer to: * Movements in architecture: ** Functionalism (architecture) ** Form follows function * Functional group, combination of atoms within molecules * Medical conditions without currently visible organic basis: ** Functional ...

(FUG) – **

Generalized phrase structure grammar Generalized phrase structure grammar (GPSG) is a framework for describing the syntax and semantics of natural languages. It is a type of constraint-based phrase structure grammar. Constraint based grammars are based around defining certain syntact ...

(GPSG) – **

Head-driven phrase structure grammar Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor ...

(HPSG) – ** Lexical functional grammar (LFG) – **

Probabilistic context-free grammar Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structur ...

(PCFG) – another name for stochastic context-free grammar. **

Stochastic context-free grammar Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA struct ...

(SCFG) – **

Systemic functional grammar Systemic functional grammar (SFG) is a form of grammatical description originated by Michael Halliday. It is part of a social semiotic approach to language called ''systemic functional linguistics''. In these two terms, ''systemic'' refers to ...

(SFG) – **

Tree-adjoining grammar Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free gr ...

(TAG) – *

Natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...

– * ''n''-gram – sequence of ''n'' number of tokens, where a "token" is a character, syllable, or word. The ''n'' is replaced by a number. Therefore, a 5-gram is an ''n''-gram of 5 letters, syllables, or words. "Eat this" is a 2-gram (also known as a bigram). **

Bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used fo ...

– ''n''-gram of 2 tokens. Every sequence of 2 adjacent elements in a string of tokens is a bigram. Bigrams are used for speech recognition, they can be used to solve cryptograms, and bigram frequency is one approach to statistical language identification. **

Trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context ...

– special case of the ''n''-gram, where ''n'' is 3. *

Ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities ...

– formal representation of a set of concepts within a domain and the relationships between those concepts. **

Taxonomy Taxonomy is the practice and science of categorization or classification. A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...

– practice and science of classification, including the principles underlying classification, and the methods of classifying things or concepts. ***

Hyponymy and hypernymy In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In othe ...

– the linguistics of hyponyms and hypernyms. A hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hypernym); which, in turn, is a hyponym of animal. *** Taxonomy for search engines – typically called a "taxonomy of entities". It is a

tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...

in which nodes are labelled with entities which are expected to occur in a web search query. These trees are used to match keywords from a search query with the keywords from relevant answers (or snippets). *

Textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...

– directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain. * Triphone – sequence of three phonemes. Triphones are useful in models of natural-language processing where they are used to establish the various contexts in which a phoneme can occur in a particular natural language.

Processes of NLP

Applications

Automated essay scoring Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to c ...

(AES) – the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural-language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. *

Automatic image annotation Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of comp ...

– process by which a computer system automatically assigns textual metadata in the form of captioning or keywords to a digital image. The annotations are used in image retrieval systems to organize and locate images of interest from a database. *

Automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commo ...

– process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. ** Types *** Keyphrase extraction – *** Document summarization – **** Multi-document summarization – ** Methods and techniques *** Extraction-based summarization – *** Abstraction-based summarization – *** Maximum entropy-based summarization – ***

Sentence extraction Sentence extraction is a technique used for automatic summarization of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more k ...

– *** Aided summarization – **** Human aided machine summarization (HAMS) – **** Machine aided human summarization (MAHS) – *

Automatic taxonomy induction Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligen ...

– automated construction of

tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is genera ...

s from a corpus. This may be applied to building taxonomical classification systems for reading by end users, such as web directories or subject outlines. * Coreference resolution – in order to derive the correct interpretation of text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions need to be connected to the right individuals or objects. Given a sentence or larger chunk of text, coreference resolution determines which words ("mentions") refer to which objects ("entities") included in the text. **

Anaphora resolution In linguistics, anaphora () is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an a ...

– concerned with matching up pronouns with the nouns or names that they refer to. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to). *

Dialog system A dialogue system, or conversational agent (CA), is a computer system intended to converse with a human. Dialogue systems employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and o ...

– *

Foreign-language reading aid Computer-assisted language learning (CALL), British, or Computer-Aided Instruction (CAI)/Computer-Aided Language Instruction (CALI), American, is briefly defined in a seminal work by Levy (1997: p. 1) as "the search for and study of applicat ...

– computer program that assists a non-native language user to read properly in their target language. The proper reading means that the pronunciation should be correct and stress to different parts of the words should be proper. *

Foreign-language writing aid A foreign language writing aid is a computer program or any other instrument that assists a non-native language user (also referred to as a foreign language learner) in writing decently in their target language. Assistive operations can be classifi ...

– computer program or any other instrument that assists a non-native language user (also referred to as a foreign-language learner) in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. * Grammar checking – the act of verifying the grammatical correctness of written text, especially if this act is performed by a

computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer progra ...

. * Information retrieval – ** Cross-language information retrieval – *

Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

(MT) – aims to automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed " AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly. ** Classical approach of machine translation – rules-based machine translation. **

Computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...

– ***

Interactive machine translation Interactive machine translation (IMT), is a specific sub-field of computer-aided translation. Under this translation paradigm, the computer software that assists the human translator attempts to predict the text the user is going to input by taki ...

– ***

Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translati ...

– database that stores so-called "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. ** Example-based machine translation – **

Rule-based machine translation Rule-based machine translation (RBMT; "Classical Approach" of MT) is machine translation systems based on linguistic information about source and target languages basically retrieved from (unilingual, bilingual or multilingual) dictionaries and gram ...

– * Natural-language programming – interpreting and compiling instructions communicated in natural language into computer instructions (machine code). * Natural-language search – *

Optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...

(OCR) – given an image representing printed text, determine the corresponding text. *

Question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...

– given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). ** Open domain question answering – *

Spam filtering Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email ( false positives) as opposed t ...

– *

Sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...

– extracts subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing. *

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

– given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed " AI-complete" (see above). In

natural speech In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

there are hardly any pauses between successive words, and thus

speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...

is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed

coarticulation Coarticulation in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. There are two types of coarticulation: ''anticipatory coarticulat ...

, so the conversion of the analog signal to discrete characters can be a very difficult process. *

Speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

(Text-to-speech) – * Text-proofing – *

Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...

– automated editing a document to include fewer words, or use easier words, while retaining its underlying meaning and information.

Component processes

* Natural-language understanding – converts chunks of text into more formal representations such as

first-order logic First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quanti ...

structures that are easier for computer programs to manipulate. Natural-language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural-language expression which usually takes the form of organized notations of natural-languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural-languages semantics without confusions with implicit assumptions such as

closed-world assumption The closed-world assumption (CWA), in a formal system of logic used for knowledge representation, is the presumption that a statement that is true is also known to be true. Therefore, conversely, what is not currently known to be true, is false. Th ...

(CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. * Natural-language generation – task of converting information from computer databases into readable human language.

Component processes of natural-language understanding

* Automatic document classification (text categorization) – **

Automatic language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, sol ...

– * Compound term processing – category of techniques that identify compound terms and match them to their definitions. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term. *

– * Corpus processing – **

Automatic acquisition of lexicon Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage ...

– ** Text normalization – **

– *

Deep linguistic processing Deep linguistic processing is a natural language processing framework which draws on theoretical and descriptive linguistics. It models language predominantly by way of theoretical syntactic/semantic theory (e.g. CCG, HPSG, LFG, TAG, the Prague Sc ...

– *

– includes a number of related tasks. One task is identifying the

discourse Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. ...

structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the

speech act In the philosophy of language and linguistics, speech act is something expressed by an individual that not only presents information but performs an action as well. For example, the phrase "I would like the kimchi; could you please pass it to me? ...

s in a chunk of text (e.g. yes-no questions, content questions, statements, assertions, orders, suggestions, etc.). *

– **

Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

– process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. *** Biomedical text mining – (also known as BioNLP), this is text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field drawing elements from natural-language processing, bioinformatics, medical informatics and computational linguistics. There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed. ***

Decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...

– ***

– **

Terminology extraction Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...

– * Latent semantic indexing – *

Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmati ...

– groups together all like terms that share a same lemma such that they are classified as a single item. * Morphological segmentation – separates words into individual

morphemes A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology. In English, morphemes are often but not necessarily words. Morphemes that stand alone are ...

and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as

Turkish Turkish may refer to: *a Turkic language spoken by the Turks * of or about Turkey ** Turkish language *** Turkish alphabet ** Turkish people, a Turkic ethnic group and nation *** Turkish citizen, a citizen of Turkey *** Turkish communities and mi ...

, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. *

Named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre- ...

(NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although

capitalization Capitalization (American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term ...

can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or

Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walte ...

) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example,

German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...

capitalizes all

noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Organism, Living creatures (including people ...

s, regardless of whether they refer to names, and

French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...

and

Spanish Spanish might refer to: * Items from or related to Spain: ** Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Ca ...

do not capitalize names that serve as

adjective In linguistics, an adjective ( abbreviated ) is a word that generally modifies a noun or noun phrase or describes its referent. Its semantic role is to change information given by the noun. Traditionally, adjectives were considered one of the ...

s. * Ontology learning – automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an ontology language for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition". *

Parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...

– determines the

parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in comp ...

(grammatical analysis) of a given sentence. The

grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...

for

s is

ambiguous Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...

and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). ** Shallow parsing – *

Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...

– given a sentence, determines the

part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a

("the book on the table") or

verb A verb () is a word ( part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descr ...

("to book a flight"); "set" can be a

; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a

tonal language Tone is the use of pitch in language to distinguish lexical or grammatical meaning – that is, to distinguish or to inflect words. All verbal languages use pitch to express emotional and other paralinguistic information and to convey emph ...

during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning. * Query expansion – *

Relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE a ...

– given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom). *

Semantic analysis (computational) Semantic analysis (computational) within applied linguistics and computer science, is a composite of semantic analysis and computational components. ''Semantic analysis'' refers to a formal analysis of meaning, and ''computational'' refers to appr ...

– formal analysis of meaning, and "computational" refers to approaches that in principle support effective implementation. ** Explicit semantic analysis – **

Latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...

– ** Semantic analytics – * Sentence breaking (also known as

sentence boundary disambiguation Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing too ...

and sentence detection) – given a chunk of text, finds the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations). *

Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...

– given a sound clip of a person or people speaking, separates it into words. A subtask of

and typically grouped with it. *

Stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morph ...

– reduces an inflected or derived word into its

word stem In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem ...

, base, or

root In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...

form. * Text chunking – * Tokenization – given a chunk of text, separates it into distinct words, symbols, sentences, or other units * Topic segmentation and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment. *

Truecasing Truecasing, also called capitalization recovery, capitalization correction, or case restoration, is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This comm ...

– *

Word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...

– separates a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese,

Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...

and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the

vocabulary A vocabulary is a set of familiar words within a person's language. A vocabulary, usually developed with age, serves as a useful and fundamental tool for communication and acquiring knowledge. Acquiring an extensive vocabulary is one of the la ...

and morphology of words in the language. *

Word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to cons ...

(WSD) – because many words have more than one

meaning Meaning most commonly refers to: * Meaning (linguistics), meaning which is communicated through the use of language * Meaning (philosophy), definition, elements, and types of meaning discussed in philosophy * Meaning (non-linguistic), a general te ...

, word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definit ...

. **

Word-sense induction In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense i ...

– open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context. **

Automatic acquisition of sense-tagged corpora The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexica ...

– * W-shingling – set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

Component processes of natural-language generation

Natural-language generation – task of converting information from computer databases into readable human language. *

(ATI) – automated building of

s from a corpus. While ATI is used to construct the core of ontologies (and doing so makes it a component process of natural-language understanding), when the ontologies being constructed are end user readable (such as a subject outline), and these are used for the construction of further documentation (such as using an outline as the basis to construct a report or treatise) this also becomes a component process of natural-language generation. * Document structuring –

History of natural-language processing

History of natural-language processing *

History of machine translation Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. In the 1950s, machine translation became a reality in research, although ref ...

* History of automated essay scoring * History of natural-language user interface * History of natural-language understanding * History of optical character recognition * History of question answering * History of speech synthesis *

Turing test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...

– test of a machine's ability to exhibit intelligent behavior, equivalent to or indistinguishable from, that of an actual human. In the original illustrative example, a human judge engages in a natural-language conversation with a human and a machine designed to generate performance indistinguishable from that of a human being. All participants are separated from one another. If the judge cannot reliably tell the machine from the human, the machine is said to have passed the test. The test was introduced by Alan Turing in his 1950 paper "Computing Machinery and Intelligence," which opens with the words: "I propose to consider the question, 'Can machines think?'" *

Universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible h ...

– theory in

, usually credited to

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...

, proposing that the ability to learn grammar is hard-wired into the brain. The theory suggests that linguistic ability manifests itself without being taught (''see''

poverty of the stimulus Poverty of the stimulus (POS) is the controversial argument from linguistics that children are not exposed to rich enough data within their linguistic environments to acquire every feature of their language. This is considered evidence contrary to ...

), and that there are properties that all natural human languages share. It is a matter of observation and experimentation to determine precisely what abilities are innate and what properties are shared by all languages. *

ALPAC ALPAC (Automatic Language Processing Advisory Committee) was a committee of seven scientists led by John R. Pierce, established in 1964 by the United States government in order to evaluate the progress in computational linguistics in general and m ...

– was a committee of seven scientists led by John R. Pierce, established in 1964 by the U. S. Government in order to evaluate the progress in computational linguistics in general and machine translation in particular. Its report, issued in 1966, gained notoriety for being very skeptical of research done in machine translation so far, and emphasizing the need for basic research in computational linguistics; this eventually caused the U. S. Government to reduce its funding of the topic dramatically. *

Conceptual dependency theory Conceptual dependency theory is a model of natural language understanding used in artificial intelligence systems. Roger Schank at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was exte ...

– a model of natural-language understanding used in artificial intelligence systems.

Roger Schank Roger Carl Schank (born 1946) is an American artificial intelligence theorist, cognitive psychologist, learning scientist, educational reformer, and entrepreneur. Beginning in the late 1960s, he pioneered conceptual dependency theory (within the ...

at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was extensively used by Schank's students at Yale University such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. *

Augmented transition network An augmented transition network or ATN is a type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelli ...

– type of graph theoretic structure used in the operational definition of formal languages, used especially in parsing relatively complex natural languages, and having wide application in artificial intelligence. Introduced by William A. Woods in 1970. * Distributed Language Translation (project) –

Timeline of NLP software

General natural-language processing concepts

Sukhotin's algorithm Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics d ...

– statistical classification algorithm for classifying characters in a text as vowels or consonants. It was initially created by Boris V. Sukhotin. *

T9 (predictive text) T9 is a predictive text technology for mobile phones (specifically those that contain a 3×4 numeric keypad), originally developed by Tegic Communications, now part of Nuance Communications. T9 stands for ''Text on 9 keys.'' T9 was used on ph ...

– stands for "Text on 9 keys", is a USA-patented predictive text technology for mobile phones (specifically those that contain a 3x4 numeric keypad), originally developed by Tegic Communications, now part of Nuance Communications. * Tatoeba – free collaborative online database of example sentences geared towards foreign-language learners. * Teragram Corporation – fully owned subsidiary of SAS Institute, a major producer of statistical analysis software, headquartered in Cary, North Carolina, USA. Teragram is based in Cambridge, Massachusetts and specializes in the application of computational linguistics to multilingual natural-language processing. *

TipTop Technologies TipTop Technologies is a real-time web, social search engine with a platform for semantic analysis of natural language. Tip-Top Search provides results capturing individual and group sentiment, opinions, and experiences from content of various ...

– company that developed TipTop Search, a real-time web, social search engine with a unique platform for semantic analysis of natural language. TipTop Search provides results capturing individual and group sentiment, opinions, and experiences from content of various sorts including real-time messages from Twitter or consumer product reviews on Amazon.com. *

Transderivational search Transderivational search (often abbreviated to TDS) is a psychological and cybernetics term, meaning when a search is being conducted for a fuzzy match across a broad field. In computing the equivalent function can be performed using content-addre ...

– when a search is being conducted for a fuzzy match across a broad field. In computing the equivalent function can be performed using content-addressable memory. *

Vocabulary mismatch Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. Furnas et al. (1987) were perhaps the first to quantitatively study the vocabulary mismatch p ...

– common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. * LRE Map – * Reification (linguistics) – * Semantic Web – ** Metadata – *

Spoken dialogue system A spoken dialog system (SDS) is a computer system able to converse with a human with voice. It has two essential components that do not exist in a written text dialog system: a speech recognizer and a text-to-speech module (written text dialog syst ...

– *

Affix grammar over a finite lattice In linguistics, the affix grammars over a finite lattice (AGFL) formalism is a notation for context-free grammars with finite set-valued features, acceptable to linguists of many different schools. The AGFL-project aims at the development of a te ...

– *

Aggregation (linguistics) In linguistics, aggregation is a subtask of natural language generation, which involves merging syntactic constituents (such as sentences and phrases) together. Sometimes aggregation can be done at a conceptual level. Examples A simple exampl ...

– *

Bag-of-words model The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding ...

– model that represents a text as a bag (multiset) of its words that disregards grammar and word sequence, but maintains multiplicity. This model is a commonly used to train document classifiers *

Brill tagger The Brill tagger is an inductive method for part-of-speech tagging. It was described and invented by Eric Brill in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is: * a form of supervised learning, w ...

– *

Cache language model A cache language model is a type of statistical language model. These occur in the natural language processing subfield of computer science and assign probabilities to given sequences of words by means of a probability distribution. Statistical lan ...

– * ChaSen,

MeCab MeCab is an open-source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on th ...

– provide morphological analysis and word splitting for

Classic monolingual WSD Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora: *Classic English WSD uses the Pri ...

– * ClearForest – * CMU Pronouncing Dictionary – also known as ''cmudict'', is a public domain pronouncing dictionary designed for uses in speech technology, and was created by

Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...

(CMU). It defines a mapping from English words to their North American pronunciations, and is commonly used in speech processing applications such as the

Festival Speech Synthesis System The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Subs ...

and the

CMU Sphinx CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model traine ...

speech recognition system. * Concept mining – * Content determination – * DATR – *

DBpedia Spotlight DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantical ...

– *

– * Discourse relation – * Document-term matrix – *

Dragomir R. Radev Dragomir R. Radev (August 7, 1968 – March 29, 2023) was an American computer scientist who was a professor at Yale University, working on natural language processing and information retrieval. He also served as a University of Michigan computer ...

– * ETBLAST – *

Filtered-popping recursive transition network A filtered-popping recursive transition network (FPRTN),Javier M. Sastre"Efficient parsing using filtered-popping recursive transition networks" ''Lecture Notes in Artificial Intelligence'', 5642:241-244, 2009 or simply filtered-popping network (FPN ...

– * Robby Garner – * GeneRIF – *

Gorn address A Gorn address (Gorn, 1967) is a method of identifying and addressing any node within a tree data structure. This notation is often used for identifying nodes in a parse tree defined by phrase structure rules. The Gorn address is a sequence of zero ...

– *

Grammar induction Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of ''re-write rules'' or '' productions'' or alternatively as a finite state machine or automaton of some kind) fr ...

– * Grammatik – * Hashing-Trick – *

Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

– * Human language technology – *

– * International Conference on Language Resources and Evaluation – *

Kleene star In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...

– * Language Computer Corporation – *

Language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

– *

LanguageWare LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, tex ...

– *

Latent semantic mapping Latent semantic mapping (LSM) is a data-driven framework to model globally meaningful relationships implicit in large volumes of (often textual) data. It is a generalization of latent semantic analysis. In information retrieval, LSA enables retriev ...

– *

Legal information retrieval Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal profe ...

– * Lesk algorithm – *

Lessac Technologies Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. The firm currently has seven patents granted and three more pending for its automated method ...

– *

Lexalytics Lexalytics, Inc. provides sentiment and intent analysis to an array of companies using SaaS and cloud based technology. Salience 6, the engine behind Lexalytics, was built as an on-premises, multi-lingual text analysis engine. It is leased to ot ...

– * Lexical choice – *

Lexical Markup Framework Language resource management - Lexical markup framework (LMF; ISO 24613:2008), is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scop ...

– * Lexical substitution – * LKB – *

Logic form Logic forms are simple, first-order logic knowledge representations of natural language sentences formed by the conjunction of concept predicates related through shared arguments. Each noun, verb, adjective, adverb, pronoun, preposition and conjun ...

– * LRE Map – *

Machine translation software usability The sections below give objective criteria for evaluating the usability of machine translation software output. Stationarity or canonical form Do repeated translations converge on a single expression in both languages? I.e. does the translation ...

– *

MAREC Marec may refer to: * MAREC, a patent information query tool *Michigan Alternative and Renewable Energy Center The Michigan Alternative and Renewable Energy Center (MAREC) was a facility located in Muskegon, Michigan that promoted research, educat ...

– * Maximum entropy – *

Message Understanding Conference The Message Understanding Conferences (MUC) for computing and computer science, were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. T ...

– *

METEOR A meteoroid () is a small rocky or metallic body in outer space. Meteoroids are defined as objects significantly smaller than asteroids, ranging in size from grains to objects up to a meter wide. Objects smaller than this are classified as mic ...

– * Minimal recursion semantics – *

Morphological pattern A morphological pattern is a set of associations and/or operations that build the various forms of a lexeme, possibly by inflection, agglutination, compounding or derivation. Context The term is used in the domain of lexicons and morphology. Note ...

– * Multi-document summarization – * Multilingual notation – *

Naive semantics Naive semantics is an approach used in computer science for representing basic knowledge about a specific domain, and has been used in applications such as the representation of the meaning of natural language sentences in artificial intelligence a ...

– *

– * Natural-language interface – * Natural-language user interface – *

News analytics In trading strategy, news analysis refers to the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stor ...

– *

Nondeterministic polynomial In computational complexity theory, NP (nondeterministic polynomial time) is a complexity class used to classify decision problems. NP is the set of decision problems for which the problem instances, where the answer is "yes", have proof ...

– * Open domain question answering – *

Optimality theory In linguistics, Optimality Theory (frequently abbreviated OT) is a linguistic model proposing that the observed forms of language arise from the optimal satisfaction of conflicting constraints. OT differs from other approaches to phonological ...

– *

Paco Nathan Paco Nathan (born 1962) is an American computer scientist and early engineer of the World Wide Web. Nathan is also an author and performance art show producer who established much of his career in Austin, Texas. Early life Paco Nathan was brought ...

– *

Phrase structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue ( Post canonical systems). Some authors, however, reserve the term for more restricted grammars in ...

– * Powerset (company) – *

Production (computer science) A production or production rule in computer science is a '' rewrite rule'' specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions P is the main component in the specificati ...

– * PropBank – *

– * Realization (linguistics) – *

Recursive transition network A recursive transition network ("RTN") is a graph theoretical schematic used to represent the rules of a context-free grammar. RTNs have application to programming languages, natural language and lexical analysis. Any sentence that is construc ...

– *

Referring expression generation Referring expression generation (REG) is the subtask of natural language generation (NLG) that received most scholarly attention. While NLG is concerned with the conversion of non-linguistic information into natural language, REG focuses only on the ...

– *

Rewrite rule In mathematics, computer science, and logic, rewriting covers a wide range of methods of replacing subterms of a formula with other terms. Such methods may be achieved by rewriting systems (also known as rewrite systems, rewrite engines, or red ...

– * Semantic compression – *

Semantic neural network Semantic neural network (SNN) is based on John von Neumann's neural network von Neumann, 1966">/nowiki>von Neumann, 1966/nowiki> and Nikolai Amosov M-Network. There are limitations to a link topology for the von Neumann’s network but SNN accep ...

– *

SemEval SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. ...

– * SPL notation – *

– reduces an inflected or derived word into its

, base, or

form. *

String kernel In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the simila ...

–

Natural-language processing tools

Google Ngram Viewer The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 in Google's text co ...

– graphs ''n''-gram usage from a corpus of more than 5.2 million books

Corpora

(see

list A ''list'' is any set of items in a row. List or lists may also refer to: People * List (surname) Organizations * List College, an undergraduate division of the Jewish Theological Seminary of America * SC Germania List, German rugby uni ...

) – large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. ** Bank of English **

British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...

Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU). Content The Corpus of C ...

(COCA) **

Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearl ...

Natural-language processing toolkits

The following natural-language processing

toolkits A toolkit is an assembly of tools; set of basic building units for user interfaces. The word toolkit may refer to: * Abstract Window Toolkit * Accessibility Toolkit * Adventure Game Toolkit * B-Toolkit * Battlefield Mod Development Toolkit * Che ...

are notable collections of natural-language processing software. They are suites of

libraries A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...

, frameworks, and

applications Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a c ...

for symbolic, statistical natural-language and speech processing.

Named-entity recognizers

* ABNER (A Biomedical Named-Entity Recognizer) – open source text mining program that uses linear-chain conditional random field sequence models. It automatically tags genes, proteins and other entity names in text. Written by Burr Settles of the University of Wisconsin-Madison. * Stanford NER (Named-Entity Recognizer) — Java implementation of a Named-Entity Recognizer that uses linear-chain conditional random field sequence models. It automatically tags persons, organizations, and locations in text in English, German, Chinese, and Spanish languages. Written by Jenny Finkel and other members of the Stanford NLP Group at Stanford University.

Translation software

Comparison of machine translation applications Machine translation is an algorithm which attempts to translate text or speech from one natural language to another. General information Basic general information for popular machine translation applications. Languages features comparison ...

* Machine translation applications **

Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an A ...

** DeepL ** Linguee – web service that provides an online dictionary for a number of language pairs. Unlike similar services, such as LEO, Linguee incorporates a search engine that provides access to large amounts of bilingual, translated sentence pairs, which come from the World Wide Web. As a translation aid, Linguee therefore differs from machine translation services like Babelfish and is more similar in function to a translation memory. ** UNL Universal Networking Language **

Yahoo! Babel Fish Yahoo! Babel Fish was a free Web-based multilingual translation application. In May 2012 it was replaced by Bing Translator (now Microsoft Translator), to which queries were redirected. Although Yahoo! has transitioned its Babel Fish translation s ...

** Reverso

Other software

CTAKES Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured data, unstructured text. It processes cl ...

– open-source natural-language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context (family history of, current, unrelated to patient), and negated/not negated. Also known as Apache cTAKES. * DMAP – * ETAP-3 – proprietary linguistic processing system focusing on English and Russian. It is a

rule-based system In computer science, a rule-based system is used to store and manipulate knowledge to interpret information in a useful way. It is often used in artificial intelligence applications and research. Normally, the term ''rule-based system'' is appli ...

which uses the Meaning-Text Theory as its theoretical foundation. * JAPE – the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. *

LOLITA ''Lolita'' is a 1955 novel written by Russian-American novelist Vladimir Nabokov. The novel is notable for its controversial subject: the protagonist and unreliable narrator, a middle-aged literature professor under the pseudonym Humbert Hum ...

– "Large-scale, Object-based, Linguistic Interactor, Translator and Analyzer". LOLITA was developed by Roberto Garigliano and colleagues between 1986 and 2000. It was designed as a general-purpose tool for processing unrestricted text that could be the basis of a wide variety of applications. At its core was a semantic network containing some 90,000 interlinked concepts. *

Maluuba Maluuba is a Canadian technology company conducting research in artificial intelligence and language understanding. Founded in 2011, the company was acquired by Microsoft in 2017. In late March 2016, the company demonstrated a machine reading s ...

– intelligent personal assistant for Android devices, that uses a contextual approach to search which takes into account the user's geographic location, contacts, and language. *

METAL MT A machine translation system developed at the University of Texas and at Siemens which ran on Lisp Machines. Background Originally titled the Linguistics Research System (LRS), it was later renamed METAL (Mechanical Translation and Analysis o ...

– machine translation system developed in the 1980s at the University of Texas and at Siemens which ran on Lisp Machines. *

Never-Ending Language Learning Never-Ending Language Learning system (NELL) is a semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, NSF, and CNPq with portions of the system running on a sup ...

– semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, and the NSF, with portions of the system running on a supercomputing cluster provided by Yahoo!. NELL was programmed by its developers to be able to identify a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams. Since the beginning of 2010, the Carnegie Mellon research team has been running NELL around the clock, sifting through hundreds of millions of web pages looking for connections between the information it already knows and what it finds through its search process – to make new connections in a manner that is intended to mimic the way humans learn new information. * NLTK – * Online-translator.com – * Regulus Grammar Compiler – software system for compiling unification grammars into grammars for speech recognition systems. *

S Voice S Voice is a discontinued intelligent personal assistant and knowledge navigator which is only available as a built-in application for the Samsung Galaxy S III, S III Mini (including NFC), S4, S4 Mini, S4 Active, S5, S5 Mini, S II Plus, ...

– * Siri (software) – *

Speaktoit Dialogflow is a natural language understanding platform used to design and integrate a conversational user interface into mobile apps, web applications, devices, bots, interactive voice response systems and related uses. History In May 2012, Sp ...

– * TeLQAS – * Weka's classification tools – * word2vec – models that were developed by a team of researchers led by Thomas Milkov at Google to generate word embeddings that can reconstruct some of the linguistic context of words using shallow, two dimensional neural nets derived from a much larger vector space. *

– *

speech recognition system – *

Language Grid {{Short description, Linguistics website The Language Grid is a multilingual service platform on the Internet mainly for supporting Intercultural collaboration. It enables easy registration and sharing of language resources such as online dictionari ...

– Open source platform for language web services, which can customize language services by combining existing language services.

Chatterbots

Chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beh ...

– a text-based conversation

agent Agent may refer to: Espionage, investigation, and law *, spies or intelligence officers * Law of agency, laws involving a person authorized to act on behalf of another ** Agent of record, a person with a contractual agreement with an insuran ...

that can interact with human users through some medium, such as an instant message service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.

Classic chatterbots

Dr. Sbaitso Dr. Sbaitso is an artificial intelligence speech synthesis program released late in 1991 by Creative Labs in Singapore for MS-DOS-based personal computers. The name is an acronym for "SoundBlaster Acting Intelligent Text-to-Speech Operat ...

ELIZA ELIZA is an early natural language processing computer program created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum. Created to demonstrate the superficiality of communication between humans and machines ...

PARRY PARRY was an early example of a chatbot, implemented in 1972 by psychiatrist Kenneth Colby. History PARRY was written in 1972 by psychiatrist Kenneth Colby, then at Stanford University. While ELIZA was a tongue-in-cheek simulation of a Rog ...

* Racter (or Claude Chatterbot) * Mark V Shaney

General chatterbots

* Albert One – 1998 and 1999 Loebner winner, by Robby Garner. * A.L.I.C.E. – 2001, 2002, and 2004

Loebner Prize The Loebner Prize was an annual competition in artificial intelligence that awards prizes to the computer programs considered by the judges to be the most human-like. The prize is reported as defunct since 2020. The format of the competition was tha ...

winner developed by Richard Wallace. *

Charlix AIML, or Artificial Intelligence Markup Language, is an XML dialect for creating natural language software agents. History The XML dialect called AIML was developed by Richard Wallace and a worldwide free software community between 1995 an ...

Cleverbot Cleverbot is a chatterbot web application that uses machine learning techniques to have conversations with humans. It was created by British AI scientist Rollo Carpenter. It was preceded by Jabberwacky, a chatbot project that began in 1988 a ...

(winner of the 2010 Mechanical Intelligence Competition) * Elbot – 2008

winner, by

Fred Roberts Frederick Clark Roberts (born August 14, 1960) is an American former basketball player who played power forward in the National Basketball Association (NBA) for 13 seasons, a career spanning from 1983 to 1997, becoming a successful journeymen ...

. * Eugene Goostman – 2012 Turing 100 winner, by Vladimir Veselov. *

Fred Fred may refer to: People * Fred (name), including a list of people and characters with the name Mononym * Fred (cartoonist) (1931–2013), pen name of Fred Othon Aristidès, French * Fred (footballer, born 1949) (1949–2022), Frederico Ro ...

– an early chatterbot by Robby Garner. *

Jabberwacky Jabberwacky is a chatterbot created by British programmer Rollo Carpenter. Its stated aim is to "simulate natural human chat in an interesting, entertaining and humorous manner". It is an early attempt at creating an artificial intelligence th ...

* Jeeney AI * MegaHAL *

Mitsuku Kuki is an embodied AI bot designed to befriend humans in the metaverse. Formerly known as Mitsuku, Kuki is a chatbot created from Pandorabots AIML technology by Steve Worswick. It is a five-time winner of a Turing Test competition called the ...

, 2013 and 2016

winner * Rose - ... 2015 - 3x

winner, by Bruce Wilcox. * SimSimi – A popular artificial intelligence conversation program that was created in 2002 by ISMaker. * Spookitalk – A chatterbot used for NPCs in

Douglas Adams Douglas Noel Adams (11 March 1952 – 11 May 2001) was an English author and screenwriter, best known for ''The Hitchhiker's Guide to the Galaxy''. Originally a 1978 BBC radio comedy, ''The Hitchhiker's Guide to the Galaxy'' developed into a " ...

' '' Starship Titanic'' video game. * Ultra Hal – 2007

winner, by

Robert Medeksza The name Robert is an ancient Germanic given name, from Proto-Germanic "fame" and "bright" (''Hrōþiberhtaz''). Compare Old Dutch ''Robrecht'' and Old High German ''Hrodebert'' (a compound of '' Hruod'' ( non, Hróðr) "fame, glory, h ...

. * Verbot

Instant messenger chatterbots

GooglyMinotaur GooglyMinotaur was an instant messaging bot on the AOL Instant Messenger network. Developed by ActiveBuddy under contract by Capitol Records, GooglyMinotaur provided Radiohead Radiohead are an English rock band formed in Abingdon, Oxford ...

, specializing in Radiohead, the first bot released by

ActiveBuddy Colloquis, previously known as ActiveBuddy and Conversagent, was a company that created conversation-based interactive agents originally distributed via instant messaging platforms. The company had offices in New York, NY and Sunnyvale, CA. Fou ...

(June 2001-March 2002) *

SmarterChild SmarterChild was a chatbot available on AOL Instant Messenger and Windows Live Messenger (previously MSN Messenger) networks. History SmarterChild was an intelligent agent or "bot" developed by ActiveBuddy, Inc., with offices in New York and ...

, developed by

and released in June 2001 *

Infobot Infobot is a Perl IRC bot, first written in 1995 by Kevin Lenzo. The bot's main goal was to remember URLs and associate them with a descriptive name, so whenever someone needed a specific URL they could ask the bot. For that reason, the first Inf ...

, an assistant on IRC channels such as ''#perl'', primarily to help out with answering Frequently Asked Questions (June 1995-today) * Negobot, a bot designed to catch online pedophiles by posing as a young girl and attempting to elicit personal details from people it speaks to.

Natural-language processing organizations

AFNLP AFNLP (Asian Federation of Natural Language Processing Associations) is the organization for coordinating the natural language processing related activities and events in the Asia-Pacific region. Foundation AFNLP was founded on 4 October 2000. M ...

(Asian Federation of Natural Language Processing Associations) – the organization for coordinating the natural-language processing related activities and events in the Asia-Pacific region. *

Australasian Language Technology Association The Australasian Language Technology Association (ALTA) promotes language technology research and development in Australia and New Zealand. ALTA organises regular events for the exchange of research results and for academic and industrial training ...

– * Association for Computational Linguistics – international scientific and professional society for people working on problems involving natural-language processing.

Natural-language processing-related conferences

* Annual Meeting of the Association for Computational Linguistics (ACL) *

International Conference on Intelligent Text Processing and Computational Linguistics CICLing (International Conference on Computational Linguistics and Intelligent Text Processing; before 2017 known under the name International Conference on Intelligent Text Processing and Computational Linguistics) is an annual conference on com ...

(CICLing) * International Conference on Language Resources and Evaluation – biennial conference organised by the European Language Resources Association with the support of institutions and organisations involved in natural-language processing *

Annual Conference of the North American Chapter of the Association for Computational Linguistics Annual may refer to: *Annual publication, periodical publications appearing regularly once per year **Yearbook **Literary annual *Annual plant *Annual report *Annual giving *Annual, Morocco, a settlement in northeastern Morocco *Annuals (band), a ...

(NAACL) *

Text, Speech and Dialogue Text, Speech and Dialogue (TSD) is an annual conference involving topics on natural language processing and computational linguistics. The meeting is held every September alternating in Brno and Plzeň, Czech Republic. The first Text, Speech and ...

(TSD) – annual conference *

Text Retrieval Conference The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or ''tracks.'' It is co-sponsored by the National Institute of Standards and Technology (NIST) an ...

(TREC) – on-going series of workshops focusing on various information retrieval (IR) research areas, or tracks

Companies involved in natural-language processing

* AlchemyAPI – service provider of a natural-language processing API. * Google, Inc. – the Google search engine is an example of automatic summarization, utilizing keyphrase extraction. * Calais (Reuters product) – provider of a natural-language processing services. * Wolfram Research, Inc. developer of natural-language processing computation engine

Wolfram Alpha WolframAlpha ( ) is an answer engine developed by Wolfram Research. It answers factual queries by computing answers from externally sourced data. WolframAlpha was released on May 18, 2009 and is based on Wolfram's earlier product Wolfram Math ...

Natural-language processing publications

Books

*
Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing
' – Wermter, S., Riloff E. and Scheler, G. (editors). First book that addressed statistical and neural network learning of language. *

' – by Daniel Jurafsky and James H. Martin. Introductory book on language technology.

Book series

* '' Studies in Natural Language Processing'' – book series of the Association for Computational Linguistics, published by Cambridge University Press.

Journals

* ''

Computational Linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

'' – peer-reviewed academic journal in the field of computational linguistics. It is published quarterly by MIT Press for the Association for Computational Linguistics (ACL)

People influential in natural-language processing

* Daniel Bobrow – *

Rollo Carpenter Rollo Carpenter (born 1965) is the British-born creator of Jabberwacky and Cleverbot, learning Artificial Intelligence (AI) software. Carpenter worked as CTO of a business software startup in Silicon Valley. Career His brother is the artist ...

– creator of Jabberwacky and Cleverbot. *

– author of the seminal work ''

Syntactic Structures ''Syntactic Structures'' is an influential work in linguistics by American linguist Noam Chomsky, originally published in 1957. It is an elaboration of his teacher Zellig Harris's model of transformational generative grammar. A short monograph ...

'', which revolutionized Linguistics with '

universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible h ...

', a rule based system of syntactic structures. *

Kenneth Colby Kenneth Mark Colby (1920 – April 20, 2001) was an American psychiatrist dedicated to the theory and application of computer science and artificial intelligence to psychiatry. Colby was a pioneer in the development of computer technology as ...

– *

David Ferrucci David Ferrucci was the principal investigator who in 2007–2011 led a team of IBM and academic researchers and engineers to the development of the Watson computer system that won a television quiz. Ferrucci graduated from Manhattan College, w ...

– principal investigator of the team that created

Watson Watson may refer to: Companies * Actavis, a pharmaceutical company formerly known as Watson Pharmaceuticals * A.S. Watson Group, retail division of Hutchison Whampoa * Thomas J. Watson Research Center, IBM research center * Watson Systems, make ...

, IBM's AI computer that won the quiz show ''Jeopardy!'' * Lyn Frazier – * Daniel Jurafsky – Professor of Linguistics and Computer Science at Stanford University. With James H. Martin, he wrote the textbook ''Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics'' *

– introduced the

conceptual dependency theory Conceptual dependency theory is a model of natural language understanding used in artificial intelligence systems. Roger Schank at Stanford University introduced the model in 1969, in the early days of artificial intelligence. This model was exte ...

for natural-language understanding. *

Jean E. Fox Tree Jean E. Fox Tree is a professor in the Department of Psychology at the University of California at Santa Cruz. Fox Tree studies collateral signals that people use in spontaneous speech, such as fillers (e.g. ‘you know’), prosodic informatio ...

– *

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical c ...

– originator of the

Turing Test The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluato ...

. *

Joseph Weizenbaum Joseph Weizenbaum (8 January 1923 – 5 March 2008) was a German American computer scientist and a professor at MIT. The Weizenbaum Award is named after him. He is considered one of the fathers of modern artificial intelligence. Life and care ...

– author of the

chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beh ...

. *

Terry Winograd Terry Allen Winograd (born February 24, 1946) is an American professor of computer science at Stanford University, and co-director of the Stanford Human–Computer Interaction Group. He is known within the philosophy of mind and artificial intel ...

– professor of computer science at Stanford University, and co-director of the Stanford Human-Computer Interaction Group. He is known within the philosophy of mind and artificial intelligence fields for his work on natural language using the SHRDLU program. *

William Aaron Woods William Aaron Woods (born June 17, 1942), generally known as Bill Woods, is a researcher in natural language processing, continuous speech understanding, knowledge representation, and knowledge-based search technology. He is currently a Software E ...

– *

Maurice Gross Maurice Gross (born 21 July 1934 in Sedan, Ardennes department; died 8 December 2001 in Paris) was a French linguistJean-Claude Chevalier,, ''Le Monde'', 12 décembre 2001. and scholar of Romance languages. Beginning in the late 1960s he developed ...

– author of the concept of local grammar,Ibrahim, Amr Helmy. 2002. "Maurice Gross (1934-2001). À la mémoire de Maurice Gross". ''Hermès'' 34.
/ref> taking finite automata as the competence model of language.Dougherty, Ray. 2001. ''Maurice Gross Memorial Letter''.
/ref> * Stephen Wolfram – CEO and founder of

Wolfram Research Wolfram Research, Inc. ( ) is an American multinational company that creates computational technology. Wolfram's flagship product is the technical computing program Wolfram Mathematica, first released on June 23, 1988. Other products include ...

, creator of the programming language (natural-language understanding)

Wolfram Language The Wolfram Language ( ) is a general multi-paradigm programming language developed by Wolfram Research. It emphasizes symbolic computation, functional programming, and rule-based programming and can employ arbitrary structures and data. It is ...

, and natural-language processing computation engine

. *

Victor Yngve Victor H. Yngve (July 5, 1920 – January 15, 2012W. John HutchinVictor Yngve obituary aclweb.org; accessed August 15, 2017.) was professor of linguistics at the University of Chicago and the Massachusetts Institute of Technology (1953-1965). H ...

–

References

Bibliography

* * . * .

External links

{{Outline footer *

Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

Natural-language processing

Prerequisite technologies

Subfields of natural-language processing

Related fields

Structures used in natural-language processing

Processes of NLP

Applications

Component processes

Component processes of natural-language understanding

Component processes of natural-language generation

History of natural-language processing

Timeline of NLP software

General natural-language processing concepts

Natural-language processing tools

Corpora

Natural-language processing toolkits

Named-entity recognizers

Translation software

Other software

Chatterbots

Classic chatterbots

General chatterbots

Instant messenger chatterbots

Natural-language processing organizations

Natural-language processing-related conferences

Companies involved in natural-language processing

Natural-language processing publications

Books

Book series

Journals

People influential in natural-language processing

See also

References

Bibliography

External links