Noisy Text
   HOME

TheInfoList



OR:

Noisy text is text with differences between the surface form of a coded representation of the
text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...
and the intended, correct, or original text. The
noise Noise is sound, chiefly unwanted, unintentional, or harmful sound considered unpleasant, loud, or disruptive to mental or hearing faculties. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrat ...
may be due to
typographic error A typographical error (often shortened to typo), also called a misprint, is a mistake (such as a spelling or transposition error) made in the typing of printed or electronic material. Historically, this referred to mistakes in manual typesetting ...
s or
colloquialism Colloquialism (also called ''colloquial language'', ''colloquial speech'', ''everyday language'', or ''general parlance'') is the linguistic style used for casual and informal communication. It is the most common form of speech in conversation amo ...
s always present in
natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
and usually lowers the
data quality Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and ...
in a way that makes the text less accessible to automated processing by computers, including
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
. The noise may also have been introduced through an extraction process (e.g.,
transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, often th ...
or OCR) from media other than original
electronic text e-text (from "'' electronic text''"; sometimes written as etext) is a general term for any document that is read in digital form, and especially a document that is mainly text. For example, a computer-based book of art with minimal text, or a se ...
s. Language usage over computer mediated discourses, like chats,
email Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
s and
SMS Short Message Service, commonly abbreviated as SMS, is a text messaging service component of most telephone, Internet and mobile device systems. It uses standardized communication protocols that let mobile phones exchange short text messages, t ...
texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for
semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
clarity, shape the structure of this text used in such discourses. Various business analysts estimate that
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
constitutes around 80% of the whole enterprise data. A great proportion of this data comprises chat transcripts, emails and other informal and semi-formal internal and external communications. Usually such text is meant for human consumption, but—given the amount of data—manual processing and evaluation of those resources is not practically feasible anymore. This raises the need for robust
text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
methods.


Techniques for noise reduction

The use of
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic ...
s and
grammar checker A grammar checker, in computing terms, is a Computer program, program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a ...
s can reduce the amount of noise in typed text. Many
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current word ...
s include this in the editing tool. Online,
Google Search Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
includes a search term suggestion engine to guide users when they make mistakes with their queries.


See also

*
Data corruption Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of meas ...
*
Jargon Jargon, or technical language, is the specialized terminology associated with a particular field or area of activity. Jargon is normally employed in a particular Context (language use), communicative context and may not be well understood outside ...
*
Leet speak Leet (or "1337"), also known as eleet or leetspeak, or simply hacker speech, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via refle ...
*
Natural language understanding Natural language understanding (NLU) or natural language interpretation (NLI) is a subset of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU has been considered an AI-hard problem. Ther ...
*
Noisy channel In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible (in theory) to communicate discrete ...


References

{{reflist Coding theory