In
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, semantic compression is a process of compacting a lexicon used to build
a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text
semantics
Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
.
As a result, the same ideas can be represented using a smaller set of words.
In most applications, semantic compression is a lossy compression. Increased prolixity does not compensate for the lexical compression and an original document cannot be reconstructed in a reverse process.
By generalization
Semantic compression is basically achieved in two steps, using
frequency dictionaries and
semantic network:
# determining cumulated term frequencies to identify target lexicon,
# replacing less frequent terms with their hypernyms (
generalization
A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
) from target lexicon.
Step 1 requires assembling word frequencies and
information on semantic relationships, specifically
hyponymy
Hypernymy and hyponymy are the wikt:Wiktionary:Semantic relations, semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term ...
. Moving upwards in word hierarchy,
a cumulative concept frequency is calculating by adding a sum of hyponyms' frequencies to frequency of their hypernym:
where
is a hypernym of
.
Then a desired number of words with top cumulated frequencies are chosen to build a target lexicon.
In the second step, compression mapping rules are defined for the remaining words in order to handle every occurrence of a less frequent hyponym as its hypernym in output text.
;Example
The below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernyms.
They are both nest building social insects, but paper wasps and honey bees organize their colonies
in very different ways. In a new study, researchers report that despite their differences, these insects
rely on the same network of genes to guide their social behavior.The study appears in the Proceedings of the
Royal Society B: Biological Sciences. Honey bees and paper wasps are separated by more than 100 million years of
evolution, and there are striking differences in how they divvy up the work of maintaining a colony.
The procedure outputs the following text:
They are both facility building insect, but insects and honey insects arrange their biological groups
in very different structure. In a new study, researchers report that despite their difference of opinions, these insects
act the same network of genes to steer their party demeanor. The study appears in the proceeding of the
institution bacteria Biological Sciences. Honey insects and insect are separated by more than hundred million years of
organic processes, and there are impinging differences of opinions in how they divvy up the work of affirming a biological group.
Implicit semantic compression
A natural tendency to keep natural language expressions concise can be perceived as a form of implicit semantic compression, by omitting unmeaningful words or redundant meaningful words (especially to avoid
pleonasms).
Applications and advantages
In the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
, compacting a lexicon leads to a reduction of
dimensionality, which results in less
computational complexity
In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations ...
and a positive influence on efficiency.
Semantic compression is advantageous in
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
tasks, improving their effectiveness (in terms of both
precision and recall
In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also calle ...
).
[{{cite book , first1=D. , last1=Ceglarek , first2=K. , last2=Haniewicz , first3=W. , last3=Rutkowski , chapter=Quality of semantic compression in classification , chapter-url=https://dl.acm.org/doi/10.5555/1947662.1947683 , title=Proceedings of the 2nd International Conference on Computational Collective Intelligence: Technologies and Applications , year=2010 , publisher=Springer , isbn=978-3-642-16692-1 , pages=162–171 , volume=1 ] This is due to more precise descriptors (reduced effect of language diversity – limited language redundancy, a step towards a controlled dictionary).
As in the example above, it is possible to display the output as natural text (re-applying inflexion, adding stop words).
See also
*
Controlled natural language
Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types ...
*
Information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
*
Lexical substitution
*
Quantities of information
*
Text simplification
Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning an ...
References
External links
Semantic compression on Project SENECA (Semantic Networks and Categorization) website
Information retrieval techniques
Natural language processing
Quantitative linguistics
Computational linguistics