DisCoCat (Categorical Compositional Distributional) is a mathematical framework for
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
which uses
category theory to unify
distributional semantics
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
with the
principle of compositionality
In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
. The grammatical derivations in a
categorial grammar
Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and sema ...
(usually a
pregroup grammar) are interpreted as
linear maps
In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pr ...
acting on the
tensor product
In mathematics, the tensor product V \otimes W of two vector spaces and (over the same Field (mathematics), field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an e ...
of
word vectors
A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no cons ...
to produce the meaning of a sentence or a piece of text.
String diagrams String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector ...
are used to visualise
information flow
In discourse-based grammatical theory, information flow is any tracking of referential information by speakers. Information may be ''new,'' just introduced into the conversation; ''given,'' already active in the speakers' consciousness; or ''old ...
and reason about natural language
semantics
Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...
.
History
The framework was first introduced by
Bob Coecke
Bob Coecke (born 23 July 1968) is a Belgian theoretical physicist and logician who was professor of Quantum Foundations, Logics and Structures at Oxford University until 2020, when he became Chief Scientist of Cambridge Quantum Computing, and af ...
,
Mehrnoosh Sadrzadeh
Mehrnoosh Sadrzadeh is an Iranian British academic who is a professor at University College London. She was awarded a senior research fellowship at the Royal Academy of Engineering in 2022.
Early life and education
Sadrzadeh is from Iran. She r ...
, and Stephen Clark
as an application of
categorical quantum mechanics
Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diff ...
to
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
. It started with the observation that
pregroup grammars and quantum processes shared a common mathematical structure: they both form a
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. ...
(also known as a non-symmetric
compact closed category
In category theory, a branch of mathematics, compact closed categories are a general context for treating dual objects. The idea of a dual object generalizes the more familiar concept of the dual of a finite-dimensional vector space. So, the mo ...
). As such, they both benefit from a graphical calculus, which allows a purely diagrammatic reasoning. Although the analogy with quantum mechanics was kept informal at first, it eventually led to the development of
quantum natural language processing
Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It ...
.
Definition
There are multiple definitions of DisCoCat in the literature, depending on the choice made for the compositional aspect of the model. The common denominator between all the existent versions, however, always involves a
categorical definition of DisCoCat as a structure-preserving functor from a category of grammar to a category of semantics, which usually encodes the distributional hypothesis.
The original paper
used the
categorical product
In category theory, the product of two (or more) objects in a category is a notion designed to capture the essence behind constructions in other areas of mathematics such as the Cartesian product of sets, the direct product of groups or rin ...
of
FinVect with a
pregroup seen as a
posetal category
In mathematics, specifically category theory, a posetal category, or thin category, is a category whose homsets each contain at most one morphism. As such, a posetal category amounts to a preordered class (or a preordered set, if its objects form ...
. Unfortunately this approach does not work, indeed all parallel arrows of a posetal category are equal, which means that pregroups cannot distinguish between different grammatical derivations for the same
syntactically ambiguous
Syntactic ambiguity, also called structural ambiguity, amphiboly or amphibology, is a situation where a sentence may be interpreted in more than one way due to ambiguous sentence structure.
Syntactic ambiguity arises not from the range of meani ...
sentence.
Instead, one needs to consider the free
rigid category In category theory, a branch of mathematics, a rigid category is a monoidal category where every object is rigid, that is, has a dual ''X''* (the internal Hom 'X'', 1 and a morphism 1 → ''X'' ⊗ ''X''* satisfying natural conditions. ...
generated by the pregroup grammar. That is,
has generating objects for the words and the basic types of the grammar, and generating arrows
for the dictionary entries which assign a pregroup type
to a word
. The arrows
are grammatical derivations for the sentence
which can be represented as
string diagrams String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector ...
with cups and caps, i.e.
adjunction units and counits.
With this definition of pregroup grammars as free rigid categories, DisCoCat models can be defined as
strong monoidal functors
. Spelling things out in detail, they assign a finite dimensional
vector space
In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called '' vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but ...
to each basic type
and a vector
in the appropriate
tensor product
In mathematics, the tensor product V \otimes W of two vector spaces and (over the same Field (mathematics), field) is a vector space to which is associated a bilinear map V\times W \to V\otimes W that maps a pair (v,w),\ v\in V, w\in W to an e ...
space to each dictionary entry
where
(objects for words are sent to the monoidal unit, i.e.
). The meaning of a sentence
is then given by a vector
which can be computed as the contraction of a
tensor network
Tensor networks or tensor network states are a class of variational wave functions used in the study of many-body quantum systems. Tensor networks extend one-dimensional matrix product states to higher dimensions while preserving some of their use ...
.
The reason behind the choice of
as the category of semantics is that vector spaces are the usual setting of
distributional reading in computational linguistics and
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
. The underlying idea of
distributional hypothesis "A word is characterized by the company it keeps" is particularly relevant when assigning meaning to words like adjectives or verbs, whose semantic connotation is strongly dependent on context.
Variations
Variations of DisCoCat have been proposed with a different choice for the grammar category. The main motivation behind this lies in the fact that pregroup grammars have been proved to be weakly equivalent to context-free grammars. One example of variation chooses
Combinatory categorial grammar
Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure ...
as the grammar category.
List of linguistic phenomena
The DisCoCat framework has been used to study the following phenomena from
linguistics
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...
.
*
Entailment
Logical consequence (also entailment) is a fundamental concept in logic, which describes the relationship between statements that hold true when one statement logically ''follows from'' one or more statements. A valid logical argument is one ...
*
Coordination
Coordination may refer to:
* Coordination (linguistics), a compound grammatical construction
* Coordination complex, consisting of a central atom or ion and a surrounding array of bound molecules or ions
* Coordination number or ligancy of a centr ...
*
Hyponymy and hypernymy
In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In othe ...
*
Ambiguity
Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement w ...
with
density matrices
*
Discourse analysis
Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event.
The objects of discourse Analysis ( discourse, writing, conversation, communicative even ...
*
Anaphora and
ellipsis
The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
*
Language evolution
Evolutionary linguistics or Darwinian linguistics is a sociobiological approach to the study of language. Evolutionary linguists consider linguistics as a subfield of sociobiology and evolutionary psychology. The approach is also closely linked ...
Applications in NLP
The DisCoCat framework has been applied to solve the following tasks in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
.
*
Word-sense disambiguation
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to cons ...
*
Semantic similarity
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
*
Question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
*
Machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
*
Anaphora resolution
In linguistics, anaphora () is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an a ...
See also
*
Lambek calculus
Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and seman ...
*
Pregroup grammar
*
Distributional semantics
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. T ...
*
Principle of compositionality
In semantics, mathematical logic and related disciplines, the principle of compositionality is the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. ...
*
String diagram String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vecto ...
*
Categorical quantum mechanics
Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the diff ...
*
Quantum natural language processing
Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It ...
External links
DisCoPy a Python toolkit for computing with string diagrams
lambeq a Python library for
quantum natural language processing
Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It ...
References
{{reflist
Computational linguistics
Category theory