Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
of deciding where
sentences
''The Four Books of Sentences'' (''Libri Quattuor Sententiarum'') is a book of theology written by Peter Lombard in the 12th century. It is a systematic compilation of theology, written around 1150; it derives its name from the '' sententiae'' ...
begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of
punctuation mark
Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
s. In
written English
English orthography is the writing system used to represent spoken English, allowing readers to connect the graphemes to sound and to meaning. It includes English's norms of spelling, hyphenation, capitalisation, word breaks, emphasis, and ...
, a
period
Period may refer to:
Common uses
* Era, a length or span of time
* Full stop (or period), a punctuation mark
Arts, entertainment, and media
* Period (music), a concept in musical composition
* Periodic sentence (or rhetorical period), a concept ...
may indicate the end of a sentence, or may denote an
abbreviation, a
decimal point
A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The ch ...
, an
ellipsis
The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
, or an email address, among other possibilities. About 47% of the periods in the
Wall Street Journal
''The Wall Street Journal'' is an American business-focused, international daily newspaper based in New York City, with international editions also available in Chinese and Japanese. The ''Journal'', along with its Asian editions, is published ...
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
denote abbreviations.
Question mark
The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation mark that indicates an interrogative clause or phrase in many languages.
History
In the fifth century, Syriac Bible manuscripts used ...
s and
exclamation marks can be similarly ambiguous due to use in
emoticon
An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, m ...
s,
computer code
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progr ...
, and
slang
Slang is vocabulary (words, phrases, and usage (language), linguistic usages) of an informal register, common in spoken conversation but avoided in formal writing. It also sometimes refers to the language generally exclusive to the members of p ...
.
Some languages including Japanese and Chinese have unambiguous sentence-ending markers.
Strategies
The standard '
vanilla
Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the Mexican species, flat-leaved vanilla (''Vanilla planifolia, V. planifolia'').
Pollination is required to make the p ...
' approach to locate the end of a sentence:
:(a) If it's a period, it ends a sentence.
:(b) If the preceding token is in the hand-compiled
list of abbreviations
Flag of CARICOM.svg
Bandera MRL.png
Flag of the ERP.svg
Flag of the Guatemalan National Revolutionary Unity.svg
Lists of abbreviations contain abbreviations and acronyms in different languages and fields. They include Latin and English abbreviat ...
, then it doesn't end a sentence.
:(c) If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "
D. H. Lawrence
David Herbert Lawrence (11 September 1885 – 2 March 1930) was an English writer, novelist, poet and essayist. His works reflect on modernity, industrialization, sexuality, emotional health, vitality, spontaneity and instinct. His best-k ...
" (with
whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like "
.hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%.
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a
maximum entropy model.
Th
SATZarchitecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.
Software
;Examples of use of Perl compatible
regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s ("
PCRE
Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax ...
")
:*
((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z
:*
$sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for
PHP
PHP is a General-purpose programming language, general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementati ...
)
;Online use, libraries, and APIs
:
sent_detectorava
:
Lingua-EN-Sentenceerl
:
Sentence.pmerl
:
SATZn Adaptive Sentence Segmentation Systemby David D. PalmerC
;Toolkits that include sentence detection
:*
Apache OpenNLP
:*
Freeling (software)
:*
Natural Language Toolkit
:*
Stanford NLP
:*
GExp
:
CogComp-NLP
See also
*
Sentence spacing
Sentence spacing concerns how spaces are inserted between sentences in typeset text and is a matter of typographical convention. Since the introduction of movable-type printing in Europe, various sentence spacing conventions have been used in ...
*
Word divider
In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitespace''. ...
*
Syllabification
Syllabification () or syllabication (), also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed.
Overview
The written separation into syllables is usually marked by a hyphen when using English o ...
*
Punctuation
Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
*
Text segmentation
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
*
Sentence extraction Sentence extraction is a technique used for automatic summarization of a text.
In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more k ...
*
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translati ...
*
Multiword expression A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. ...
References
External links
pySBD - python Sentence Boundary Disambiguation
{{Natural language processing
Tasks of natural language processing