Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
of deciding where
sentences
The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages.
Background
The sentence genre emerged from works like Prosper of Aquitaine's ...
begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of
punctuation mark
Punctuation marks are marks indicating how a piece of written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, consisti ...
s. In
written English
English orthography comprises the set of rules used when writing the English language, allowing readers and writers to associate written graphemes with the sounds of spoken English, as well as other features of the language. English's orthograp ...
, a
period may indicate the end of a sentence, or may denote an
abbreviation
An abbreviation () is a shortened form of a word or phrase, by any method including shortening (linguistics), shortening, contraction (grammar), contraction, initialism (which includes acronym), or crasis. An abbreviation may be a shortened for ...
, a
decimal point
FIle:Decimal separators.svg, alt=Four types of separating decimals: a) 1,234.56. b) 1.234,56. c) 1'234,56. d) ١٬٢٣٤٫٥٦., Both a comma and a full stop (or period) are generally accepted decimal separators for international use. The apost ...
, an
ellipsis
The ellipsis (, plural ellipses; from , , ), rendered , alternatively described as suspension points/dots, points/periods of ellipsis, or ellipsis points, or colloquially, dot-dot-dot,. According to Toner it is difficult to establish when t ...
, or an email address, among other possibilities. About 47% of the periods in ''
The Wall Street Journal
''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...
''
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
denote abbreviations.
Question mark
The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation, punctuation mark that indicates a question or interrogative clause or phrase in many languages.
History
The history of the question mark is ...
s and
exclamation marks can be similarly ambiguous due to use in
emoticon
An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s,
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
, and
slang
A slang is a vocabulary (words, phrases, and linguistic usages) of an informal register, common in everyday conversation but avoided in formal writing and speech. It also often refers to the language exclusively used by the members of pa ...
.
Some languages including Japanese and Chinese have unambiguous sentence-ending markers.
Strategies
The standard '
vanilla
Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the flat-leaved vanilla (''Vanilla planifolia, V. planifolia'').
''Vanilla'' is not Autogamy, autogamous, so pollination ...
' approach to locate the end of a sentence:
:(a) If it is a period, it ends a sentence.
:(b) If the preceding token is in the hand-compiled
list of abbreviations, then it does not end a sentence.
:(c) If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "
D. H. Lawrence" (with
whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like "
.hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%.
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a
maximum entropy model. The SATZ
architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.
Software
;Examples of use of Perl compatible
regular expression
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s ("
PCRE
Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's synta ...
")
:*
((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z
:*
$sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for
PHP
PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
)
;Online use, libraries, and APIs
:* sent_detectorJava
:* Lingua-EN-Sentenceperl
:* Sentence.pmperl
:* SATZAn Adaptive Sentence Segmentation Systemby David D. PalmerC
;Toolkits that include sentence detection
:*
Apache OpenNLP
:*
Freeling (software)
:*
Natural Language Toolkit
:*
Stanford NLP
:*
GExp
:* CogComp-NLP
See also
*
Multiword expression A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. MW ...
*
Punctuation
Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
*
Sentence extraction
*
Sentence spacing
Sentence spacing concerns how Space (punctuation), spaces are inserted between sentences in typeset Written language, text and is a matter of typographical convention (norm), convention. Since the introduction of movable type, movable-type printin ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
*
Syllabification
Syllabification () or syllabication (), also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed.
Overview
The written separation into syllables is usually marked by a hyphen when using English o ...
*
Text segmentation
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in compu ...
*
Translation memory
A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The trans ...
*
Word divider
In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitesp ...
References
External links
pySBD - python Sentence Boundary Disambiguation
{{Natural language processing
Tasks of natural language processing