Truecasing, also called capitalization recovery,
capitalization correction, or case restoration, is the problem in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP) of determining the proper
capitalization
Capitalization ( North American spelling; also British spelling in Oxford) or capitalisation (Commonwealth English; all other meanings) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in ...
of words where such information is unavailable. This commonly comes up due to the standard practice (in
English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase
text messages).
Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
,
Greek
Greek may refer to:
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group
*Greek language, a branch of the Indo-European language family
**Proto-Greek language, the assumed last common ancestor of all kno ...
,
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
or
Armenian alphabet
The Armenian alphabet (, or , ) or, more broadly, the Armenian script, is an alphabetic writing system developed for Armenian and occasionally used to write other languages. It is one of the three historical alphabets of the South Caucasu ...
s, such as
Korean,
Japanese,
Chinese,
Thai,
Hebrew
Hebrew (; ''ʿÎbrit'') is a Northwest Semitic languages, Northwest Semitic language within the Afroasiatic languages, Afroasiatic language family. A regional dialect of the Canaanite languages, it was natively spoken by the Israelites and ...
,
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
,
Hindi
Modern Standard Hindi (, ), commonly referred to as Hindi, is the Standard language, standardised variety of the Hindustani language written in the Devanagari script. It is an official language of India, official language of the Government ...
, and
Georgian.
Techniques
*
Neural networks
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
that operate at the word level or the character level have been trained to recover capitalization with greater than 90% accuracy.
*
Sentence segmentation can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
*
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
can be used to identify
proper nouns (such as Africa, Jupiter, Sarah, or Amazon), which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a
determiner
Determiner, also called determinative ( abbreviated ), is a term used in some models of grammatical description to describe a word or affix belonging to a class of noun modifiers. A determiner combines with a noun to express its reference. Examp ...
, which is not used for proper nouns.
*
Named entity recognition can be used to identify proper nouns, which must be capitalized.
* A
spell checker
In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic ...
can be used to identify words that are always capitalized.
Applications
Truecasing aids in other NLP tasks, such as
named entity recognition (NER),
automatic content extraction (ACE), and
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
.
Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use
statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.
See also
*
Sentence case
*
Title case
References
{{Natural Language Processing
Tasks of natural language processing