Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to

mental process Cognition is the "mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, thought, i ...

es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of

Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...

, such signals are sometimes ambiguous and not present in all written languages. Compare speech segmentation, the process of dividing speech into linguistically meaningful portions.

Segmentation problems

Word segmentation

Word segmentation is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the

Latin alphabet The Latin alphabet, also known as the Roman alphabet, is the collection of letters originally used by the Ancient Rome, ancient Romans to write the Latin language. Largely unaltered except several letters splitting—i.e. from , and from � ...

, the

space Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...

is a good approximation of a

word divider In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitesp ...

(word

delimiter A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...

), although this concept has limits because of the variability with which languages emically regard

collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...

s and compounds. Many English compound nouns are variably written (for example, '' ice box = ice-box = icebox''; '' pig sty = pig-sty = pigsty'') with a corresponding variation in whether speakers think of them as

noun phrase A noun phrase – or NP or nominal (phrase) – is a phrase that usually has a noun or pronoun as its head, and has the same grammatical functions as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently ...

s or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm. However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where

sentences The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages. Background The sentence genre emerged from works like Prosper of Aquitaine's ...

but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited. In some writing systems however, such as the Ge'ez script used for

Amharic Amharic is an Ethio-Semitic language, which is a subgrouping within the Semitic branch of the Afroasiatic languages. It is spoken as a first language by the Amhara people, and also serves as a lingua franca for all other metropolitan populati ...

and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character. The

Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...

has published a ''Standard Annex on Text Segmentation'', exploring the issues of segmentation in multiscript texts. Word splitting is the process of

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Word splitting may also refer to the process of hyphenation. Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国会不同意。" (The US will not agree.) or "美国会不同意。" (The US Congress does not agree). For more details, see Chinese word-segmented writing.

Intent segmentation

Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words). In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase. " ll things are made of atoms ittle particles that move round in perpetual motion ttracting each other hen they are a little distance apart ut repelling pon being squeezed nto one another"

Sentence segmentation

Sentence segmentation is the problem of dividing a string of written language into its component

. In English and some other languages, using punctuation, particularly the

full stop The full stop ( Commonwealth English), period (North American English), or full point is a punctuation mark used for several purposes, most often to mark the end of a declarative sentence (as distinguished from a question or exclamation). A ...

/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries. As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.

Topic segmentation

Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...

. Segmenting the text into topics or

discourse Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. F ...

turns might be useful in some natural processing tasks: it can improve

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in topic detection and tracking systems and text summarizing problems. Many different approaches have been tried: e.g. HMM, lexical chains, passage similarity using word co-occurrence, clustering, topic modeling, etc. It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.

Automatic segmentation approaches

Automatic segmentation is the problem in

of implementing a computer process to segment text. When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements. The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches: * Manual analysis of text and writing custom software * Annotate the sample corpus with boundary information and use

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

References

{{Natural Language Processing Tasks of natural language processing