HOME

TheInfoList



OR:

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
of deciding where
sentences ''The Four Books of Sentences'' (''Libri Quattuor Sententiarum'') is a book of theology written by Peter Lombard in the 12th century. It is a systematic compilation of theology, written around 1150; it derives its name from the ''sententiae'' o ...
begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of
punctuation mark Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
s. In written English, a
period Period may refer to: Common uses * Era, a length or span of time * Full stop (or period), a punctuation mark Arts, entertainment, and media * Period (music), a concept in musical composition * Periodic sentence (or rhetorical period), a concept ...
may indicate the end of a sentence, or may denote an
abbreviation An abbreviation (from Latin ''brevis'', meaning ''short'') is a shortened form of a word or phrase, by any method. It may consist of a group of letters or words taken from the full version of the word or phrase; for example, the word ''abbrevia ...
, a
decimal point A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The ch ...
, an
ellipsis The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
denote abbreviations.
Question mark The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation mark that indicates an interrogative clause or phrase in many languages. History In the fifth century, Syriac Bible manuscripts used q ...
s and exclamation marks can be similarly ambiguous due to use in
emoticon An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
s,
computer code A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progra ...
, and
slang Slang is vocabulary (words, phrases, and linguistic usages) of an informal register, common in spoken conversation but avoided in formal writing. It also sometimes refers to the language generally exclusive to the members of particular in-gr ...
. Some languages including Japanese and Chinese have unambiguous sentence-ending markers.


Strategies

The standard '
vanilla Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the Mexican species, flat-leaved vanilla (''Vanilla planifolia, V. planifolia''). Pollination is required to make the p ...
' approach to locate the end of a sentence: :(a) If it's a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence. :(c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. " D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like " .hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. Th
SATZ
architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.


Software

;Examples of use of Perl compatible
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s ("
PCRE Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax ...
") :* ((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z :* $sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for
PHP PHP is a general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by The PHP Group. ...
) ;Online use, libraries, and APIs :
sent_detector
ava :
Lingua-EN-Sentence
erl :
Sentence.pm
erl :
SATZ
n Adaptive Sentence Segmentation Systemby David D. PalmerC ;Toolkits that include sentence detection :* Apache OpenNLP

:* Freeling (software)

:* Natural Language Toolkit

:* Stanford NLP

:* GExp

:
CogComp-NLP


See also

*
Sentence spacing Sentence spacing concerns how spaces are inserted between sentences in typeset text and is a matter of typographical convention. Since the introduction of movable-type printing in Europe, various sentence spacing conventions have been used i ...
*
Word divider In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitespace''. T ...
*
Syllabification Syllabification () or syllabication (), also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed. Overview The written separation into syllables is usually marked by a hyphen when using English o ...
*
Punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...
*
Text segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
* Sentence extraction *
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translati ...
*
Multiword expression A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. M ...


References


External links


pySBD - python Sentence Boundary Disambiguation
{{Natural language processing Tasks of natural language processing