HOME

TheInfoList



OR:

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
of deciding where
sentences The ''Sentences'' (. ) is a compendium of Christian theology written by Peter Lombard around 1150. It was the most important religious textbook of the Middle Ages. Background The sentence genre emerged from works like Prosper of Aquitaine's ...
begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of
punctuation mark Punctuation marks are marks indicating how a piece of written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, consisti ...
s. In
written English English orthography comprises the set of rules used when writing the English language, allowing readers and writers to associate written graphemes with the sounds of spoken English, as well as other features of the language. English's orthograp ...
, a period may indicate the end of a sentence, or may denote an
abbreviation An abbreviation () is a shortened form of a word or phrase, by any method including shortening (linguistics), shortening, contraction (grammar), contraction, initialism (which includes acronym), or crasis. An abbreviation may be a shortened for ...
, a
decimal point FIle:Decimal separators.svg, alt=Four types of separating decimals: a) 1,234.56. b) 1.234,56. c) 1'234,56. d) ١٬٢٣٤٫٥٦., Both a comma and a full stop (or period) are generally accepted decimal separators for international use. The apost ...
, an
ellipsis The ellipsis (, plural ellipses; from , , ), rendered , alternatively described as suspension points/dots, points/periods of ellipsis, or ellipsis points, or colloquially, dot-dot-dot,. According to Toner it is difficult to establish when t ...
, or an email address, among other possibilities. About 47% of the periods in ''
The Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...
''
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
denote abbreviations.
Question mark The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation, punctuation mark that indicates a question or interrogative clause or phrase in many languages. History The history of the question mark is ...
s and exclamation marks can be similarly ambiguous due to use in
emoticon An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s,
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
, and
slang A slang is a vocabulary (words, phrases, and linguistic usages) of an informal register, common in everyday conversation but avoided in formal writing and speech. It also often refers to the language exclusively used by the members of pa ...
. Some languages including Japanese and Chinese have unambiguous sentence-ending markers.


Strategies

The standard '
vanilla Vanilla is a spice derived from orchids of the genus ''Vanilla (genus), Vanilla'', primarily obtained from pods of the flat-leaved vanilla (''Vanilla planifolia, V. planifolia''). ''Vanilla'' is not Autogamy, autogamous, so pollination ...
' approach to locate the end of a sentence: :(a) If it is a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled list of abbreviations, then it does not end a sentence. :(c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. " D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like " .hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.


Software

;Examples of use of Perl compatible
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s ("
PCRE Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's synta ...
") :* ((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z :* $sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
) ;Online use, libraries, and APIs :* sent_detectorJava :* Lingua-EN-Sentenceperl :* Sentence.pmperl :* SATZAn Adaptive Sentence Segmentation Systemby David D. PalmerC ;Toolkits that include sentence detection :* Apache OpenNLP :* Freeling (software) :* Natural Language Toolkit :* Stanford NLP :* GExp :* CogComp-NLP


See also

*
Multiword expression A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. MW ...
*
Punctuation Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
* Sentence extraction *
Sentence spacing Sentence spacing concerns how Space (punctuation), spaces are inserted between sentences in typeset Written language, text and is a matter of typographical convention (norm), convention. Since the introduction of movable type, movable-type printin ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
*
Syllabification Syllabification () or syllabication (), also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed. Overview The written separation into syllables is usually marked by a hyphen when using English o ...
*
Text segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in compu ...
*
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The trans ...
*
Word divider In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitesp ...


References


External links


pySBD - python Sentence Boundary Disambiguation
{{Natural language processing Tasks of natural language processing