DisCoCat
   HOME

TheInfoList



OR:

DisCoCat (Categorical Compositional Distributional) is a mathematical framework for
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
which uses
category theory Category theory is a general theory of mathematical structures and their relations. It was introduced by Samuel Eilenberg and Saunders Mac Lane in the middle of the 20th century in their foundational work on algebraic topology. Category theory ...
to unify distributional semantics with the
principle of compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
. The grammatical derivations in a categorial grammar (usually a
pregroup grammar Pregroup grammar (PG) is a Formal grammar, grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rath ...
) are interpreted as
linear maps In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
acting on the
tensor product In mathematics, the tensor product V \otimes W of two vector spaces V and W (over the same field) is a vector space to which is associated a bilinear map V\times W \rightarrow V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of ...
of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise
information flow In discourse-based grammatical theory, information flow is any tracking of referential information by speakers. Information may be ''new,'' i.e., just introduced into the conversation''; given,'' i.e., already active in the speakers' consciousne ...
and reason about natural language
semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
.


History

The framework was first introduced by Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark as an application of
categorical quantum mechanics Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diff ...
to
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. It started with the observation that
pregroup grammar Pregroup grammar (PG) is a Formal grammar, grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rath ...
s and quantum processes shared a common mathematical structure: they both form a
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. The ...
(also known as a non-symmetric
compact closed category In category theory, a branch of mathematics, compact closed categories are a general context for treating dual objects. The idea of a dual object generalizes the more familiar concept of the dual of a finite-dimensional vector space. So, the mot ...
). As such, they both benefit from a graphical calculus, which allows a purely diagrammatic reasoning. Although the analogy with quantum mechanics was kept informal at first, it eventually led to the development of
quantum natural language processing Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It i ...
.


Definition

There are multiple definitions of DisCoCat in the literature, depending on the choice made for the compositional aspect of the model. The common denominator between all the existent versions, however, always involves a categorical definition of DisCoCat as a structure-preserving functor from a category of grammar to a category of semantics, which usually encodes the distributional
hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...
. The original paper used the
categorical product In category theory, the product of two (or more) objects in a category is a notion designed to capture the essence behind constructions in other areas of mathematics such as the Cartesian product of sets, the direct product of groups or rings, an ...
of FinVect with a pregroup seen as a
posetal category In mathematics, specifically category theory, a posetal category, or thin category, is a Category (mathematics), category whose Category (mathematics)#Small and large categories, homsets each contain at most one morphism. As such, a posetal catego ...
. This approach has some shortcomings: all parallel arrows of a posetal category are equal, which means that pregroups cannot distinguish between different grammatical derivations for the same syntactically ambiguous sentence. A more intuitive manner of saying the same is that one works with diagrams rather than with partial orders when describing grammar. This problem is overcome when one considers the free
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. The ...
\mathbf generated by the pregroup grammar. That is, \mathbf has generating objects for the words and the basic types of the grammar, and generating arrows w \to t for the dictionary entries which assign a pregroup type t to a word w. The arrows f: w_1 \dots w_n \to s are grammatical derivations for the sentence w_1 \dots w_n which can be represented as string diagrams with cups and caps, i.e. adjunction units and counits. With this definition of pregroup grammars as free rigid categories, DisCoCat models can be defined as strong monoidal functors F : \mathbf \to \mathbf. Spelling things out in detail, they assign a finite dimensional
vector space In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...
F(x) to each basic type x and a vector F(w) \in F(t) = F(t_1) \otimes \dots \otimes F(t_n) in the appropriate
tensor product In mathematics, the tensor product V \otimes W of two vector spaces V and W (over the same field) is a vector space to which is associated a bilinear map V\times W \rightarrow V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an element of ...
space to each dictionary entry w \to t where t = t_1 \dots t_n (objects for words are sent to the monoidal unit, i.e. F(w) = 1). The meaning of a sentence f: w_1 \dots w_n \to s is then given by a vector F(f) \in F(s) which can be computed as the contraction of a tensor network. The reason behind the choice of \mathbf as the category of semantics is that vector spaces are the usual setting of distributional reading in computational linguistics and
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. The underlying idea of distributional hypothesis "A word is characterized by the company it keeps" is particularly relevant when assigning meaning to words like adjectives or verbs, whose semantic connotation is strongly dependent on context.


Variations

Variations of DisCoCat have been proposed with a different choice for the grammar category. The main motivation behind this lies in the fact that pregroup grammars have been proved to be weakly equivalent to context-free grammars. One example of variation chooses
Combinatory categorial grammar Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structur ...
as the grammar category.


List of linguistic phenomena

The DisCoCat framework has been used to study the following phenomena from
linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
. *
Entailment Logical consequence (also entailment or logical implication) is a fundamental concept in logic which describes the relationship between statements that hold true when one statement logically ''follows from'' one or more statements. A valid l ...
*
Coordination Coordination may refer to: * Coordination (linguistics), a compound grammatical construction * Coordination complex, consisting of a central atom or ion and a surrounding array of bound molecules or ions ** A chemical reaction to form a coordinati ...
*
Hyponymy and hypernymy Hypernymy and hyponymy are the wikt:Wiktionary:Semantic relations, semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term ...
*
Ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...
with
density matrices In quantum mechanics, a density matrix (or density operator) is a matrix used in calculating the probabilities of the outcomes of measurements performed on physical systems. It is a generalization of the state vectors or wavefunctions: while th ...
*
Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, spoken, or sign language, including any significant semiotic event. The objects of discourse analysis (discourse, writing, conversation, communicative sy ...
* Anaphora and
ellipsis The ellipsis (, plural ellipses; from , , ), rendered , alternatively described as suspension points/dots, points/periods of ellipsis, or ellipsis points, or colloquially, dot-dot-dot,. According to Toner it is difficult to establish when t ...
* Language evolution


Applications in NLP

The DisCoCat framework has been applied to solve the following tasks in
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. *
Word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...
*
Semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tool ...
*
Question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a n ...
*
Machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
*
Anaphora resolution In linguistics, anaphora () is the use of an expression whose interpretation depends upon another expression in context (its antecedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an antecedent expr ...


See also

*
Lambek calculus Joachim "Jim" Lambek (5 December 1922 – 23 June 2014) was a Canadian mathematician. He was Peter Redpath Emeritus Professor of Pure Mathematics at McGill University, where he earned his PhD degree in 1950 with Hans Zassenhaus as advisor. B ...
*
Pregroup grammar Pregroup grammar (PG) is a Formal grammar, grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rath ...
* Distributional semantics *
Principle of compositionality In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
* String diagram *
Categorical quantum mechanics Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diff ...
*
Quantum natural language processing Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It i ...


External links


DisCoPy
a Python toolkit for computing with string diagrams
lambeq
a Python library for
quantum natural language processing Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It i ...


References

{{reflist Computational linguistics Category theory