Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

of deciding where

sentences The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages. Background The sentence genre emerged from works like Prosper of Aquitaine's ...

begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of

punctuation mark Punctuation marks are marks indicating how a piece of written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, consisti ...

s. In

written English English orthography comprises the set of rules used when writing the English language, allowing readers and writers to associate written graphemes with the sounds of spoken English, as well as other features of the language. English's orthograp ...

, a period may indicate the end of a sentence, or may denote an

abbreviation An abbreviation () is a shortened form of a word or phrase, by any method including shortening (linguistics), shortening, contraction (grammar), contraction, initialism (which includes acronym), or crasis. An abbreviation may be a shortened for ...

, a

decimal point FIle:Decimal separators.svg, alt=Four types of separating decimals: a) 1,234.56. b) 1.234,56. c) 1'234,56. d) ١٬٢٣٤٫٥٦., Both a comma and a full stop (or period) are generally accepted decimal separators for international use. The apost ...

, an

ellipsis The ellipsis (, plural ellipses; from , , ), rendered , alternatively described as suspension points/dots, points/periods of ellipsis, or ellipsis points, or colloquially, dot-dot-dot,. According to Toner it is difficult to establish when t ...

, or an email address, among other possibilities. About 47% of the periods in ''

The Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...

corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

denote abbreviations.

Question mark The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation, punctuation mark that indicates a question or interrogative clause or phrase in many languages. History The history of the question mark is ...

s and exclamation marks can be similarly ambiguous due to use in

emoticon An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...

source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...

, and

slang A slang is a vocabulary (words, phrases, and linguistic usages) of an informal register, common in everyday conversation but avoided in formal writing and speech. It also often refers to the language exclusively used by the members of pa ...

. Some languages including Japanese and Chinese have unambiguous sentence-ending markers.

Strategies

The standard '

vanilla Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the flat-leaved vanilla (''Vanilla planifolia, V. planifolia''). ''Vanilla'' is not Autogamy, autogamous, so pollination ...

' approach to locate the end of a sentence: :(a) If it is a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled list of abbreviations, then it does not end a sentence. :(c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. " D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like " .hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

;Examples of use of Perl compatible

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s ("

PCRE Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's synta ...

") :* ((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z :* $sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for

PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...

) ;Online use, libraries, and APIs :* sent_detectorJava :* Lingua-EN-Sentenceperl :* Sentence.pmperl :* SATZAn Adaptive Sentence Segmentation Systemby David D. PalmerC ;Toolkits that include sentence detection :* Apache OpenNLP :* Freeling (software) :* Natural Language Toolkit :* Stanford NLP :* GExp :* CogComp-NLP

References

External links

pySBD - python Sentence Boundary Disambiguation
{{Natural language processing Tasks of natural language processing

Strategies

Software

See also

References

External links