A number of
text encodings have historically been used for storing text on the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
, though by now
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
is dominant, with all languages at 95% use or higher by some estimates. The same encodings are used in local files (or databases), in fact many more, at least historically. Measuring the prevalence of each are not possible, because of privacy reasons (e.g. for local files, not web accessible), but rather accurate estimates are available for public web sites, and statistics may (or may not accurately) reflect use in local files. Attempts at measuring encoding popularity may utilize counts of numbers of (web) documents, or counts weighed by actual use or visibility of those documents.
The decision to use any one encoding may depend on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pure
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
text is valid ASCII or
ISO-8859-1
ISO/IEC 8859-1:1998, ''Information technology—8-bit computing, 8-bit single-byte coded graphic character (computing), character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character enc ...
or
CP1252 or UTF-8.
Tags may indicate a document encoding, but when this is incorrect this may be silently corrected by display software (for instance the
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
specification says that the tag for ISO-8859-1 should be treated as CP1252), so counts of tags may not be accurate.
Popularity on the World Wide Web
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
has been the most common encoding for the
World Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
since 2008.
, UTF-8 is used by 98.6% of surveyed web sites (and 99.2% of top 100,000 pages), the next-most popular encoding,
ISO-8859-1
ISO/IEC 8859-1:1998, ''Information technology—8-bit computing, 8-bit single-byte coded graphic character (computing), character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character enc ...
, is used by 1.1% (and only 15 of the top 1,000 pages).
Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8.
All countries (and over 97% all of the tracked languages) have at least 96% use of the UTF-8 encoding on the web. See below for the major alternative encodings:
The second-most popular encoding varies depending on locale, and is typically more efficient for the associated language. One such encoding is the Chinese
GB 18030
GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
standard, which is a full
Unicode Transformation Format, still 96% of
website
A website (also written as a web site) is any web page whose content is identified by a common domain name and is published on at least one web server. Websites are typically dedicated to a particular topic or purpose, such as news, educatio ...
s in China and territories use UTF-8 with it (effectively) the next popular encoding.
Big5 is another popular non-
UTF encoding meant for
traditional Chinese characters
Traditional Chinese characters are a standard set of Chinese character forms used to written Chinese, write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education (Taiwan), Ministry of Educat ...
(though
GB 18030
GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
works for those too, is a full
UTF), and is next-most popular in Taiwan after UTF-8 at 96.8%, and it's also second-most used in Hong Kong, while there as elsewhere, UTF-8 is even more dominant at 98.4%. The single-byte
Windows-1251
Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.
On the web, it is the second most-used ...
is twice as efficient for the
Cyrillic script
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic languages, Slavic, Turkic languages, Turkic, Mongolic languages, Mongolic, Uralic languages, Uralic, C ...
and still 96.1% of Russian websites use UTF-8 (however e.g. Greek and Hebrew encodings are also twice as efficient, and UTF-8 has over 99% use for those languages). Korean, Chinese and Japanese language websites also have relatively high non-UTF-8 use compared to most other countries, with Japanese UTF-8 use at 98.8% the rest use the legacy
EUC-JP and/or
Shift JIS
Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...
(actually decoded as its superset
Windows-31J) encodings that both are used about as much.
South Korea has 95% UTF-8 use, with the rest of websites mainly using
EUC-KR
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese language, Japanese, Korean language, Korean, and simplified Chinese characters, simplified Chinese (characters).
The most commonly used EUC codes are va ...
which is more efficient for Korean text.
Popularity for local text files
Local storage on computers has considerably more use of "legacy" single-byte encodings than on the web. Attempts to update to UTF-8 have been blocked by editors that do not display or write UTF-8 unless the first character in a file is a
byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output. UTF-16 files are also fairly common on Windows, but not in other systems.
Popularity internally in software
In the memory of a computer program, usage of
UTF-16 is very common, particularly in Windows but also cross-platform languages and libraries such as
JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
,
Python, and
Qt. Compatibility with the Windows API is a major reason for this. Non-Windows libraries written in the early days of Unicode also tend to use UTF-16, such as
International Components for Unicode
International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
.
At one time it was believed by many (and is still believed today by some) that having fixed-size code units offers computational advantages, which led many systems, in particular Windows, to use the fixed-size UCS-2 with two bytes per character. This is false: strings are almost never randomly accessed, and sequential access is the same speed in both variable- and fixed-size encodings. In addition, even UCS-2 was not "fixed size" if
combining character
In digital typography, combining characters are Character (computing), characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritic, diacritical marks (including c ...
s are considered, and when Unicode exceeded 65536 code points it had to be replaced with the non-fixed-sized UTF-16 anyway.
Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, overwhelms any benefits could offer. So newer software systems are starting to use UTF-8. The default string primitive used in newer programing languages, such as
Go,
Julia,
Rust and
Swift
Swift or SWIFT most commonly refers to:
* SWIFT, an international organization facilitating transactions between banks
** SWIFT code
* Swift (programming language)
* Swift (bird), a family of birds
It may also refer to:
Organizations
* SWIF ...
5, assume UTF-8 encoding.
PyPy also uses UTF-8 for its strings, and Python is looking into storing all strings in UTF-8.
Microsoft now recommends the use of UTF-8 for applications using the
Windows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.
References
{{reflist
Character encoding
Unicode Transformation Formats
Character sets