Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

of deciding where

sentences ''The Four Books of Sentences'' (''Libri Quattuor Sententiarum'') is a book of theology written by Peter Lombard in the 12th century. It is a systematic compilation of theology, written around 1150; it derives its name from the '' sententiae'' ...

begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of

punctuation mark Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...

s. In

written English English orthography is the writing system used to represent spoken English, allowing readers to connect the graphemes to sound and to meaning. It includes English's norms of spelling, hyphenation, capitalisation, word breaks, emphasis, and ...

, a

period Period may refer to: Common uses * Era, a length or span of time * Full stop (or period), a punctuation mark Arts, entertainment, and media * Period (music), a concept in musical composition * Periodic sentence (or rhetorical period), a concept ...

may indicate the end of a sentence, or may denote an abbreviation, a

decimal point A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The ch ...

, an

ellipsis The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...

, or an email address, among other possibilities. About 47% of the periods in the

Wall Street Journal ''The Wall Street Journal'' is an American business-focused, international daily newspaper based in New York City, with international editions also available in Chinese and Japanese. The ''Journal'', along with its Asian editions, is published ...

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

denote abbreviations.

Question mark The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation mark that indicates an interrogative clause or phrase in many languages. History In the fifth century, Syriac Bible manuscripts used ...

s and exclamation marks can be similarly ambiguous due to use in

emoticon An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, m ...

computer code A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progr ...

, and

slang Slang is vocabulary (words, phrases, and usage (language), linguistic usages) of an informal register, common in spoken conversation but avoided in formal writing. It also sometimes refers to the language generally exclusive to the members of p ...

. Some languages including Japanese and Chinese have unambiguous sentence-ending markers.

Strategies

The standard '

vanilla Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the Mexican species, flat-leaved vanilla (''Vanilla planifolia, V. planifolia''). Pollination is required to make the p ...

' approach to locate the end of a sentence: :(a) If it's a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled

list of abbreviations Flag of CARICOM.svg Bandera MRL.png Flag of the ERP.svg Flag of the Guatemalan National Revolutionary Unity.svg Lists of abbreviations contain abbreviations and acronyms in different languages and fields. They include Latin and English abbreviat ...

, then it doesn't end a sentence. :(c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "

D. H. Lawrence David Herbert Lawrence (11 September 1885 – 2 March 1930) was an English writer, novelist, poet and essayist. His works reflect on modernity, industrialization, sexuality, emotional health, vitality, spontaneity and instinct. His best-k ...

" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like " .hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. Th
SATZ
architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

;Examples of use of Perl compatible

regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s ("

PCRE Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax ...

") :* ((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z :* $sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for

PHP PHP is a General-purpose programming language, general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementati ...

) ;Online use, libraries, and APIs :
sent_detector
ava :
Lingua-EN-Sentence
erl :
Sentence.pm
erl :
SATZ
n Adaptive Sentence Segmentation Systemby David D. PalmerC ;Toolkits that include sentence detection :* Apache OpenNLP

:* Freeling (software)

:* Natural Language Toolkit

:* Stanford NLP

:* GExp

:
CogComp-NLP

References

External links

pySBD - python Sentence Boundary Disambiguation
{{Natural language processing Tasks of natural language processing

Strategies

Software

See also

References

External links