Text segmentation is the process of dividing written text into meaningful units, such as words,
sentences, or
topics. The term applies both to
mental process
Cognition is the "mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, thought, i ...
es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
, such signals are sometimes ambiguous and not present in all written languages.
Compare
speech segmentation, the process of dividing speech into linguistically meaningful portions.
Segmentation problems
Word segmentation
Word segmentation is the problem of dividing a string of written language into its component words.
In English and many other languages using some form of the
Latin alphabet
The Latin alphabet, also known as the Roman alphabet, is the collection of letters originally used by the Ancient Rome, ancient Romans to write the Latin language. Largely unaltered except several letters splitting—i.e. from , and from � ...
, the
space
Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...
is a good approximation of a
word divider
In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitesp ...
(word
delimiter
A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...
), although this concept has limits because of the variability with which languages
emically regard
collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...
s and
compounds. Many
English compound nouns are variably written (for example, ''
ice box = ice-box = icebox''; ''
pig sty = pig-sty = pigsty'') with a corresponding variation in whether speakers think of them as
noun phrase
A noun phrase – or NP or nominal (phrase) – is a phrase that usually has a noun or pronoun as its head, and has the same grammatical functions as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently ...
s or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast,
German compound nouns show less orthographic variation, with solidification being a stronger norm.
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where
sentences
The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages.
Background
The sentence genre emerged from works like Prosper of Aquitaine's ...
but not words are delimited,
Thai and
Lao, where phrases and sentences but not words are delimited, and
Vietnamese, where syllables but not words are delimited.
In some writing systems however, such as the
Ge'ez script used for
Amharic
Amharic is an Ethio-Semitic language, which is a subgrouping within the Semitic branch of the Afroasiatic languages. It is spoken as a first language by the Amhara people, and also serves as a lingua franca for all other metropolitan populati ...
and
Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
The
Unicode Consortium
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...
has published a ''Standard Annex on Text Segmentation'', exploring the issues of segmentation in multiscript texts.
Word splitting is the process of
parsing
Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of
hyphenation.
Some scholars have suggested that modern Chinese should be written in word segmentation, with
spaces between words like written English. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see
Chinese word-segmented writing.
Intent segmentation
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
"
ll things are made of atoms ittle particles that move round in perpetual motion ttracting each other hen they are a little distance apart ut repelling pon being squeezed nto one another"
Sentence segmentation
Sentence segmentation is the problem of dividing a string of written language into its component
sentences
The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages.
Background
The sentence genre emerged from works like Prosper of Aquitaine's ...
. In English and some other languages, using punctuation, particularly the
full stop
The full stop ( Commonwealth English), period (North American English), or full point is a punctuation mark used for several purposes, most often to mark the end of a declarative sentence (as distinguished from a question or exclamation).
A ...
/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.
Topic segmentation
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in
document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...
.
Segmenting the text into
topics or
discourse
Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. F ...
turns might be useful in some natural processing tasks: it can improve
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
or
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in
topic detection and tracking systems and
text summarizing problems.
Many different approaches have been tried:
e.g.
HMM,
lexical chains, passage similarity using word
co-occurrence,
clustering,
topic modeling, etc.
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
Other segmentation problems
Processes may be required to segment text into segments besides mentioned, including
morpheme
A morpheme is any of the smallest meaningful constituents within a linguistic expression and particularly within a word. Many words are themselves standalone morphemes, while other words contain multiple morphemes; in linguistic terminology, this ...
s (a task usually called
morphological analysis) or
paragraph
A paragraph () is a self-contained unit of discourse in writing dealing with a particular point or idea. Though not required by the orthographic conventions of any language with a writing system, paragraphs are a conventional means of organizing ...
s.
Automatic segmentation approaches
Automatic segmentation is the problem in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
of implementing a computer process to segment text.
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
See also
*
Hyphenation
*
Natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Speech segmentation
*
Lexical analysis
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
*
Word count
*
Line breaking
*
Image segmentation
References
{{Natural Language Processing
Tasks of natural language processing