HOME

TheInfoList



OR:

Unicode, formally The Unicode Standard,The formal version reference is is an
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system ...
standard Standard may refer to: Symbols * Colours, standards and guidons, kinds of military signs * Standard (emblem), a type of a large symbol or emblem used for identification Norms, conventions or requirements * Standard (metrology), an object ...
for the consistent
encoding In communications and information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or secrecy, secret ...
, representation, and handling of
text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...
expressed in most of the world's
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable for ...
s. The standard, which is maintained by the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...
, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic
scripts Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of ha ...
, as well as symbols,
emoji An emoji ( ; plural emoji or emojis) is a pictogram, logogram, ideogram or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed convers ...
(including in colors), and non-visual control and formatting codes. Unicode's success at unifying character sets has led to its widespread and predominant use in the
internationalization and localization In computing, internationalization and localization ( American) or internationalisation and localisation (British English), often abbreviated i18n and L10n, are means of adapting computer software to different languages, regional peculiarities an ...
of computer
software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...
. The standard has been implemented in many recent technologies, including modern
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
s,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
, and most modern
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming l ...
s. The Unicode character repertoire is synchronized with
ISO/IEC 10646 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
, each being code-for-code identical with the other. ''The Unicode Standard'', however, includes more than just the base code. Alongside the character encodings, the Consortium's official publication includes a wide variety of details about the scripts and how to display them:
normalization Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to: * Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of ...
rules, decomposition,
collation Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fil ...
, rendering, and
bidirectional text A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in ...
display order for multilingual texts, and so on. The ''Standard'' also includes reference data files and visual charts to help developers and designers correctly implement the repertoire. Unicode can be stored using several different
encodings In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
, which translate the character codes into sequences of bytes. The Unicode standard defines three and several other encodings exist, all in practice variable-length encodings. The most common encodings are the
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
-compatible
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
, the ASCII-incompatible
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
(compatible with the obsolete
UCS-2 The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), whi ...
), and the Chinese Unicode encoding standard GB18030 which is not an official Unicode standard but is used in China and implements Unicode fully.


Origin and development

Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the
ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12 ...
standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using
Latin character The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy ...
s and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other). Unicode, in intent, encodes the underlying characters—
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s and grapheme-like units—rather than the variant
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
s (renderings) for such characters. In the case of
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as '' kan ...
, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featu ...
). In text processing, Unicode takes the role of providing a unique —a
number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers ...
, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape,
font In movable type, metal typesetting, a font is a particular #Characteristics, size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "Sort (typesetting), sort") for each glyph. A typeface consists of a range of ...
, or style) to other software, such as a
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
or
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Word processor (electronic device), Early word processors were stand-alone devices ded ...
. This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode. The first 256 code points were made identical to the content of
ISO/IEC 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...
so as to make it trivial to convert existing western text. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "
fullwidth forms In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK: 半角) characters. Unlik ...
" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean ( CJK) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. For other examples, see
duplicate characters in Unicode Unicode has a certain amount of duplication of characters. These are pairs of single Unicode code points that are canonically equivalent. The reason for this are compatibility issues with legacy systems. Unless two characters are canonically equi ...
. Unicode Bulldog Award recipients include many names influential in the development of Unicode and include
Tatsuo Kobayashi is a Japanese web architect who specializes in international standardization. Born and raised in Tokyo, he studied history and philosophy of science at the University of Tokyo. After graduating from the university, he joined Shogakukan Inc. an ...
, Thomas Milo, Roozbeh Pournader,
Ken Lunde Ken Roger Lunde (, born 12 August 1965 in Madison, Wisconsin)Lunde, 2008. is an American specialist in information processing for East Asian languages. Academic Background Ken majored in linguistics at University of Wisconsin–Madison in 1985, wh ...
, and
Michael Everson Michael Everson (born January 9, 1963) is an American and Irish linguist, script encoder, typesetter, type designer and publisher. He runs a publishing company called Evertype, through which he has published over a hundred books since 2006. Hi ...
.


History

Based on experiences with the
Xerox Character Code Standard The Xerox Character Code Standard (XCCS) is a historical 16-bit character encoding that was created by Xerox in 1980 for the exchange of information between elements of the Xerox Network Systems Architecture. It encodes the characters required ...
(XCCS) since 1980, the origins of Unicode can be traced back to 1987, when Joe Becker from
Xerox Xerox Holdings Corporation (; also known simply as Xerox) is an American corporation that sells print and digital document products and services in more than 160 countries. Xerox is headquartered in Norwalk, Connecticut (having moved from St ...
with Lee Collins and Mark Davis from
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus '' Malus''. The tree originated in Central Asia, where its wild ances ...
started investigating the practicalities of creating a universal character set. With additional input from Peter Fenwick and
Dave Opstad David G. Opstad (born ) is a retired American computer scientist specializing during his career in computer typography and information processing (focusing on character encodings), leading to several breakthroughs. Opstad was a contributor to Un ...
, Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "the name 'Unicode' is intended to suggest a unique, unified, universal encoding". In this document, entitled ''Unicode 88'', Becker outlined a
16-bit 16-bit microcomputers are microcomputers that use 16-bit microprocessors. A 16-bit register can store 216 different values. The range of integer values that can be stored in 16 bits depends on the integer representation used. With the two mos ...
character model:
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 214 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.
In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of
Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, ...
, and in 1990, Michel Suignard and Asmus Freytag from
Microsoft Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
and Rick McGowan of
NeXT Next may refer to: Arts and entertainment Film * ''Next'' (1990 film), an animated short about William Shakespeare * ''Next'' (2007 film), a sci-fi film starring Nicolas Cage * '' Next: A Primer on Urban Painting'', a 2005 documentary film Lit ...
joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. The
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...
was incorporated in California on 3 January 1991, and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992. In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g.,
Egyptian hieroglyphs Egyptian hieroglyphs (, ) were the formal writing system used in Ancient Egypt, used for writing the Egyptian language. Hieroglyphs combined logographic, syllabic and alphabetic elements, with some 1,000 distinct characters.There were about 1, ...
) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them much more essential than envisioned in the original architecture of Unicode. The Microsoft TrueType specification version 1.0 from 1992 used the name 'Apple Unicode' instead of 'Unicode' for the Platform ID in the naming table.


Unicode Consortium

The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including
Adobe Adobe ( ; ) is a building material made from earth and organic materials. is Spanish for '' mudbrick''. In some English-speaking regions of Spanish heritage, such as the Southwestern United States, the term is used to refer to any kind of ...
,
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus '' Malus''. The tree originated in Central Asia, where its wild ances ...
,
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin ...
,
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
, IBM,
Microsoft Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
,
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
, and
SAP SE Sap is a fluid transported in xylem cells (vessel elements or tracheids) or phloem sieve tube elements of a plant. These cells transport water and nutrients throughout the plant. Sap is distinct from latex, resin, or cell sap; it is a separat ...
. Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the
Ministry of Endowments and Religious Affairs (Oman) The Ministry of Awqaf and Religious Affairs (MARA) is the governmental body in the Sultanate of Oman responsible for overseeing all matters related to awqaf and religious affairs. The current Minister of Awqaf and Religious Affairs is Abdulla ...
is a full member with voting rights. The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with
multilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. It is believed that multilingual speakers outnumber monolingual speakers in the world's population. More than half of all E ...
environments.


Scripts covered

Unicode currently covers most major
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable for ...
s in use today. , a total of 161
scripts Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of ha ...
are included in the latest version of Unicode (covering
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a s ...
s,
abugida An abugida (, from Ge'ez language, Ge'ez: ), sometimes known as alphasyllabary, neosyllabary or pseudo-alphabet, is a segmental Writing systems#Segmental writing system, writing system in which consonant-vowel sequences are written as units; ...
s and
syllabaries In the linguistic study of written languages, a syllabary is a set of written symbols that represent the syllables or (more frequently) moras which make up words. A symbol in a syllabary, called a syllabogram, typically represents an (option ...
), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and
music Music is generally defined as the The arts, art of arranging sound to create some combination of Musical form, form, harmony, melody, rhythm or otherwise Musical expression, expressive content. Exact definition of music, definitions of mu ...
(in the form of notes and rhythmic symbols), also occur. The Unicode Roadmap Committee (
Michael Everson Michael Everson (born January 9, 1963) is an American and Irish linguist, script encoder, typesetter, type designer and publisher. He runs a publishing company called Evertype, through which he has published over a hundred books since 2006. Hi ...
, Rick McGowan, Ken Whistler, V.S. Umamaheswaran) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap page of the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...
website. For some scripts on the Roadmap, such as Jurchen and
Khitan small script The Khitan small script () was one of two writing systems used for the now-extinct Khitan language (the other was the Khitan large script). It was used during the 10th–12th century by the Khitan people, who had created the Liao Empire in presen ...
, encoding proposals have been made and they are working their way through the approval process. For other scripts, such as
Mayan Mayan most commonly refers to: * Maya peoples, various indigenous peoples of Mesoamerica and northern Central America * Maya civilization, pre-Columbian culture of Mesoamerica and northern Central America * Mayan languages, language family spoken ...
(besides numbers) and
Rongorongo Rongorongo (Rapa Nui: ) is a system of glyphs discovered in the 19th century on Rapa Nui (Easter Island) that appears to be writing or proto-writing. Numerous attempts at decipherment have been made, with none being successful. Although some c ...
, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved. Some modern invented scripts which have not yet been included in Unicode (e.g.,
Tengwar The Tengwar script is an artificial script, one of several scripts created by J. R. R. Tolkien, the author of ''The Lord of the Rings''. Within the fictional context of Middle-earth, the Tengwar were invented by the Elf Fëanor, and used ...
) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g.,
Klingon The Klingons ( ; Klingon language, Klingon: ''tlhIngan'' ) are a fictional species in the science fiction franchise ''Star Trek''. Developed by screenwriter Gene L. Coon in 1967 for the Star Trek: The Original Series, original ''Star Trek'' ('' ...
) are listed in the
ConScript Unicode Registry The ConScript Unicode Registry is a discontinued volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA) for the encoding of artificial scripts including those for constructed languages. It was founded b ...
, along with unofficial but widely used
Private Use Areas In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
code assignments. There is also a
Medieval Unicode Font Initiative In digital typography, the Medieval Unicode Font Initiative (MUFI) is a project which aims to coordinate the encoding and display of special characters in medieval texts written in the Latin alphabet, which are not encoded as part of Unicode. ...
focused on special Latin medieval characters. Part of these proposals have been already included into Unicode.


Script Encoding Initiative

The Script Encoding Initiative, a project run by Deborah Anderson at the
University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years.


Versions

The Unicode Consortium and the
International Organization for Standardization The International Organization for Standardization (ISO ) is an international standard development organization composed of representatives from the national standards organizations of member countries. Membership requirements are given in Ar ...
(ISO) have together developed a shared
repertoire A repertoire () is a list or set of dramas, operas, musical compositions or roles which a company or person is prepared to perform. Musicians often have a musical repertoire. The first known use of the word ''repertoire'' was in 1847. It is a ...
following the initial publication of ''The Unicode Standard'' in 1991; Unicode and the ISO's
Universal Coded Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646,