HOME

TheInfoList



OR:

Character encoding detection, charset detection, or code page detection is the process of
heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
ally guessing the
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
, such as a HTTP header is either not available, or is assumed to be untrustworthy. This algorithm usually involves statistical analysis of byte patterns; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data. In general, incorrect charset detection leads to
mojibake Mojibake (; , 'character transformation') is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often ...
. One of the few cases where charset detection works reliably is detecting
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
. This is due to the large percentage of invalid byte sequences in UTF-8,In a ''random'' byte string, a byte with the high bit set has only a 1/15 chance of starting a valid UTF-8 code point. Odds are even lower in actual text, which is not random but tends to contain isolated bytes with the high bit set which are always invalid in UTF-8. so that text in any other encoding that uses bytes with the high bit set is ''extremely'' unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, websites in UTF-8 containing the name of the German city München may display , due to the code deciding that the encoding was
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology—8-bit computing, 8-bit single-byte coded graphic character (computing), character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character enc ...
or
Windows-1252 Windows-1252 or CP-1252 ( Windows code page 1252) is a legacy single-byte character encoding that is used by default (as the "ANSI code page") in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa. Initially ...
before (or without) even testing to see if it was UTF-8. UTF-16 is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common characters ''must'' be checked for, relying on a test to see that the text is valid UTF-16 fails: the Windows operating system would mis-detect the phrase " Bush hid the facts" (without a newline) in ASCII as Chinese UTF-16LE, since all the byte pairs matched assigned Unicode characters in UTF-16LE. Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognizing them relies on identifying language features, such as letter frequencies or spellings. Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. See Character encodings in HTML#Specifying the document's character encoding. Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixed byte order mark (BOM).


See also

*
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
– a library that can perform charset detection * Language identification * Content sniffing * Browser sniffing – a similar heuristic technique for determining the capabilities of a web browser, before serving content to it


References


External links


IMultiLanguage2::DetectInputCodepage



Reference for ''cpdetector'' charset detection



Java port of Mozilla Charset Detectors

Delphi/Pascal port of Mozilla Charset Detectors

''uchardet'', C++ fork of Mozilla Charset Detectors; includes Bash command-line tool

C# port of Mozilla Charset Detectors

HEBCI, a technique for detecting the character set used in form submissions

Frequency distributions of English trigraphs
{{character encoding, state=collapsed Character encoding