HOME

TheInfoList



OR:

In
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
, a Private Use Area (PUA) is a range of
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s that, by definition, will not be assigned characters by the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intentio ...
. Three private use areas are defined: one in the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecima ...
(), and one each in, and nearly covering, planes 15 and 16 (, ). The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions. Assignments to Private Use Area characters need not be private in the sense of strictly internal to an organisation; a number of assignment schemes have been published by several organisations. Such publication may include a font that supports the definition (showing the glyphs), and software making use of the private-use characters (e.g. a graphics character for a "print document" function). By definition, multiple private parties may assign different characters to the same code point, with the consequence that a user may see one private character from an installed font where a different one was intended.


Definition

Under the Unicode definition, code points in the Private Use Areas are assigned characters—they are not noncharacters, reserved, or unassigned. Their
category Category, plural categories, may refer to: Philosophy and general uses *Categorization, categories in cognitive science, information science and generally * Category of being * ''Categories'' (Aristotle) * Category (Kant) * Categories (Peirce) ...
is "Other, private use (Co)", and no character names are specified. No representative glyphs are provided, and character semantics are left to private agreement.
Private-use characters are assigned Unicode code points whose interpretation is not specified by this standard and whose use may be determined by private agreement among cooperating users. These characters are designated for private use and do not have defined, interpretable semantics except by private agreement. ... No charts are provided for private-use characters, as any such characters are, by their very nature, defined only outside the context of this standard.


Assignment

In the Basic Multilingual Plane (plane 0), the block titled Private Use Area has 6400 code points. Planes 15 and 16 are almostThe last two characters of every plane are defined to be noncharacters. The remaining 65,534 characters of each of planes 15 and 16 are assigned as private-use characters. entirely assigned to two further Private Use Areas, Supplementary Private Use Area-A and Supplementary Private Use Area-B respectively. In
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
a subset of the high surrogates (U+DB80..U+DBFF) is used for these and only these planes, and are called High Private Use Surrogates.


Unicode PUA blocks

There are three PUA blocks in Unicode.


History

In Unicode 1.0.0, the private use area extended from U+E800 to U+FDFF (i.e. did not include U+E000..E7FF, but additionally included the U+F900..FDFF range now occupied by CJK Compatibility Ideographs,
Alphabetic Presentation Forms Alphabetic Presentation Forms is a Unicode block containing standard ligatures for the Latin, Armenian, and Hebrew scripts. Block History The following Unicode-related documents record the purpose and process of defining specific characters in ...
and
Arabic Presentation Forms-A Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. This block also allocates 32 noncharacters in Unicode, designed specifically f ...
). This was changed to U+E000..F8FF in Unicode 1.0.1, and remained so in Unicode 1.1. Contrary to misconception, the range U+D800..DFFF (reserved for
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
surrogates since Unicode 2.0) was not included in the private use range of any Unicode 1.x version. Historically, planes E0 (224) through FF (255), and groups 60 (96) though 7F (127) of the
Universal Coded Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of character (computing), characters defined by the international standard International Organization for Standardization, ISO/International Electrotechnical Commission, IEC  ...
(i.e. U+E00000 through U+FFFFFF and U+60000000 through U+7FFFFFFF) were also designated as private use. These ranges were removed from the specified private-use ranges when the UCS was restricted to the seventeen planes reachable in UTF-16.


Usage


Standardization initiative uses

Many people and institutions have created character collections for the PUA. Some of these private use agreements are published, so other PUA implementers can aim for unused or less used code points to prevent overlaps. Several characters and scripts previously encoded in private use agreements have actually been fully encoded in Unicode, necessitating mappings from the PUA to other Unicode code points. One of the more well-known and broadly implemented PUA agreements is maintained by the ConScript Unicode Registry (CSUR). The CSUR, which is not officially endorsed or associated with the Unicode Consortium, provides a mapping for constructed scripts, such as Klingon pIqaD and Ferengi script (Star Trek),
Tengwar The Tengwar script is an artificial script, one of several scripts created by J. R. R. Tolkien, the author of ''The Lord of the Rings''. Within the fictional context of Middle-earth, the Tengwar were invented by the Elf Fëanor, and use ...
and Cirth (J.R.R. Tolkien's cursive and runic scripts), Alexander Melville Bell's Visible Speech, and Dr. Seuss' alphabet from '' On Beyond Zebra''. The CSUR previously encoded the undeciphered
Phaistos Phaistos ( el, Φαιστός, ; Ancient Greek: , , Minoan: PA-I-TO?http://grbs.library.duke.edu/article/download/11991/4031&ved=2ahUKEwjor62y3bHoAhUEqYsKHZaZArAQFjASegQIAhAB&usg=AOvVaw1MwIv3ekgX-SxkJrbORipd ), also transliterated as Phaesto ...
characters, as well as the
Shavian The Shavian alphabet (; also known as the Shaw alphabet) is an alphabet conceived as a way to provide simple, phonemic orthography for the English language to replace the difficulties of English orthography, conventional spelling using the E ...
and Deseret alphabets, which have all been accepted for official encoding in Unicode. Another common PUA agreement is maintained by the
Medieval Unicode Font Initiative In digital typography, the Medieval Unicode Font Initiative (MUFI) is a project which aims to coordinate the encoding and display of special characters in medieval texts written in the Latin alphabet, which are not encoded as part of Unicode. ...
(MUFI). This project is attempting to support all of the scribal abbreviations, ligatures,
precomposed character A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diacri ...
s, symbols, and alternate
letterforms A letterform, letter-form or letter form, is a term used especially in typography, palaeography, calligraphy and epigraphy to mean a letter's shape. A letterform is a type of glyph, which is a specific, concrete way of writing an abstract charac ...
found in medieval texts written in the Latin alphabet. The express purpose of MUFI is to experimentally determine which characters are necessary to represent these texts, and to have those characters officially encoded in Unicode. As of Unicode version 5.1, 152 MUFI characters have been incorporated into the official Unicode encoding. Some agreed-upon PUA character collections exist in part or whole because the Unicode Consortium is in no hurry to encode them. Some, such as unrepresented languages, are likely to end up encoded in the future. Some unusual cases such as fictional languages are outside the usual scope of Unicode but not explicitly ruled out by the principles of Unicode, and may show up eventually (such as the Star Trek and Tolkien writing systems). In other cases, the proposed encoding violates one or more Unicode principles and hence is unlikely to ever be officially recognized by Unicode—mostly where users want to directly encode alternate forms, ligatures, or base-character-plus-diacritic combinations (such as the TUNE scheme). * Emoji is an encoding for picture characters or emoticons used in Japanese wireless messages and webpages. With Unicode 6.0 and later, many of these have been encoded in the block
Miscellaneous Symbols And Pictographs Miscellaneous Symbols and Pictographs is a Unicode block containing meteorological and astronomical symbols, emoji characters largely for compatibility with Japanese telephone carriers' implementations of Shift JIS, and characters originally fro ...
and elsewhere in the SMP. * GB/T 20542-2006 ("Tibetan Coded Character Set Extension A") and GB/T 22238-2008 ("Tibetan Coded Character Set Extension B") are Chinese national standards that use the PUA to encode precomposed Tibetan ligatures. *
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
and GBK use the PUA to provisionally encode characters not found in Unicode standards at the time of publication (most have been encoded since then). * The
Institute of the Estonian Language Institute of the Estonian Language ( et, Eesti Keele Instituut) is the language regulator of the Estonian language. It is located in Tallinn Tallinn () is the most populous and capital city of Estonia. Situated on a bay in north Estonia, on ...
uses the PUA to encode Latin and Cyrillic precomposed characters that have no Unicode encoding. * Th
Free Tengwar Font Project
uses a different mapping from the ConScript Unicode Registry that largely follows Michael Everson's 2001-03-07 Tengwar discussion paper, but diverges in some details. * The MARC 21 standard uses the PUA to encode East Asian characters present in MARC-8 that have no Unicode encoding. * The SIL Corporate PUA uses the PUA to encode characters used in minority languages that have not yet been accepted into Unicode. * The
STIX Fonts project The STIX Fonts project or Scientific and Technical Information Exchange (STIX), is a project sponsored by several leading scientific and technical publishers to provide, under royalty-free license, a comprehensive font set of mathematical symbols a ...
uses the PUA to provide a comprehensive font set of mathematical symbols and alphabets, many of which are also available in the SMP now, e.g. in the
Mathematical Alphanumeric Symbols Mathematical Alphanumeric Symbols is a Unicode block comprising styled forms of Latin and Greek letters and decimal digits that enable mathematicians to denote different notions with different letter styles. The letters in various fonts ofte ...
block. * The Tamil Unicode New Encoding (TUNE) is a proposed scheme for encoding
Tamil Tamil may refer to: * Tamils, an ethnic group native to India and some other parts of Asia **Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils **Tamil Malaysians, Tamil people native to Malaysia * Tamil language, nativ ...
that overcomes perceived deficiencies in the current Unicode encoding.


Vendor use

Informally, the range U+F000 through U+F8FF is known as the Corporate Use Area. This originates from early versions of Unicode, which defined an "End User Zone" extending from U+E000 upward and a "Corporate Use Zone" extending from U+F8FF downward, with the boundary between the two left undefined. * The Adobe Glyph List used to use the PUA for some of its glyphs. *
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, where its wild ancestor, ' ...
lists a range of 1,280 characters in its developer documentation from U+F400–U+F8FF within the PUA for Apple's use. Of those, only 311 are used, in the range U+F700–U+F8FF (
NeXT Next may refer to: Arts and entertainment Film * ''Next'' (1990 film), an animated short about William Shakespeare * ''Next'' (2007 film), a sci-fi film starring Nicolas Cage * '' Next: A Primer on Urban Painting'', a 2005 documentary film Lit ...
(
NeXTSTEP NeXTSTEP is a discontinued object-oriented, multitasking operating system based on the Mach kernel and the UNIX-derived BSD. It was developed by NeXT Computer in the late 1980s and early 1990s and was initially used for its range of propri ...
and
OPENSTEP OpenStep is a defunct object-oriented application programming interface (API) specification for a legacy object-oriented operating system, with the basic goal of offering a NeXTSTEP-like environment on non-NeXTSTEP operating systems. OpenStep wa ...
) and
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, where its wild ancestor, ' ...
(Mac OS X AppKit)). ** One of these is U+F8FF, the
Apple logo Apple Inc. is an American multinational technology company headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling in 2021) and, as of June 2022, is the world's biggest company b ...
, generally supported by Apple's 8-bit sets. * WGL4 uses the PUA (U+F001 and U+F002) to encode duplicates of the ligatures (U+FB01) (U+FB02). *
Microsoft's Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
defunct Services For Macintosh feature used U+F001 through U+F029 as replacements for special characters allowed in HFS but forbidden in
NTFS New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fil ...
, and U+F02A for the Apple logo. * In old versions of its RichEdit component, Microsoft mapped U+F020–U+F0FF within the PUA to symbol fonts. For any character in this range, RichEdit would show a character from a symbol font instead of the end-user-defined character (EUDC) * uses U+F8FC–U+F8FE for ⌀ (diameter sign), ± (
plus–minus sign The plus–minus sign, , is a mathematical symbol with multiple meanings. *In mathematics, it generally indicates a choice of exactly two possible values, one of which is obtained through addition and the other through subtraction. *In experim ...
) and ° (degree sign) respectively. * Some fonts place Windows logo key at U+F000. * Number U+F000 is a numeral succession starting at 13 or 18 in some video games like '' Agar.io''. * On
Ubuntu Ubuntu ( ) is a Linux distribution based on Debian and composed mostly of free and open-source software. Ubuntu is officially released in three editions: '' Desktop'', ''Server'', and ''Core'' for Internet of things devices and robots. All ...
, U+E0FF is displayed as the "Circle Of Friends" logo and U+F200 is "ubuntu" in the Ubuntu typeface with a superscripted "Circle Of Friends" (this itself is U+F0FF). * Th
3270
font includes the Debian logo at U+F100 * In the Linux Libertine font, U+E000 displays Tux, the mascot of
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
* The Font Awesome icon font utilizes the PUA to display various glyphs. * Powerline, a status line plugin for vim, use U+E0A0–U+E0A2 and U+E0B0–U+E0B3 for extra
box-drawing character Box-drawing characters, also known as line-drawing characters, are a form of semigraphics widely used in text user interfaces to draw various geometric frames and boxes. Box-drawing characters typically only work well with monospaced fonts. ...
s. * On the
Fira Sans Fira Sans is a humanist sans-serif typeface designed by Erik Spiekermann, Ralph du Carrois, Anja Meiners, Botio Nikoltchev of Carrois Type Design and Patryk Adamczyk of Mozilla Corporation. Originally commissioned by Telefónica and Mozilla Corpora ...
typeface used in
Firefox OS Firefox OS (project name: ''Boot to Gecko'', also known as ''B2G'') is a discontinued open-source operating system made for smartphones, tablet computers, smart TVs, and dongles designed by Mozilla and external contributors. It is based on the ...
, U+E003 is displayed as the
Mozilla Mozilla (stylized as moz://a) is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, w ...
logo (the dinosaur head). *
Lotus Multi-Byte Character Set The Lotus Multi-Byte Character Set (LMBCS) is a proprietary multi-byte character encoding originally conceived in 1988 at Lotus Development Corporation with input from Bob Balaban and others. Created around the same time and addressing some of the ...
(LMBCS), the encoding and character set internally used by Lotus/ IBM
Lotus 1-2-3 Lotus 1-2-3 is a discontinued spreadsheet program from Lotus Software (later part of IBM). It was the first killer application of the IBM PC, was hugely popular in the 1980s, and significantly contributed to the success of IBM PC-compatibles i ...
, Symphony, SmartSuite,
Notes Note, notes, or NOTE may refer to: Music and entertainment * Musical note, a pitched sound (or a symbol for a sound) in music * ''Notes'' (album), a 1987 album by Paul Bley and Paul Motian * ''Notes'', a common (yet unofficial) shortened versio ...
,
Domino Dominoes is a family of tile-based games played with gaming pieces, commonly known as dominoes. Each domino is a rectangular tile, usually with a line dividing its face into two square ''ends''. Each end is marked with a number of spots (also c ...
as well as a number of third-party products such as Microsoft Works, uses some characters (U+F862-U+F89F and U+F8FB-U+F8FE) in the Private Use Area for symbols not defined in Unicode. Of these, U+F8FB is known to be reserved for a crown currency symbol ("Kr"), and U+F8FC and U+F8FD were later mapped to U+FB02 () and U+FB01 () respectively. Additionally, when UTF-16 codes are embedded in LMBCS, the UTF-16 codes corresponding to U+F601 through U+F6FF are substituted for UTF-16 codes which would contain
null byte The null character (also null terminator) is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646 (or ASCII), the C0 control code, the Universal Coded ...
s, since LMBCS is designed to not contain embedded null bytes. * IBM reserved several code page IDs for PUA code pages: code page 1446 for the generic plane 15, code page 1447 for the generic plane 16, code page 1448 for the generic BMP PUA, code page 1445 (IBM AFP PUA No. 1) for plane 15 with IBM allocations in U+FFF00–U+FFFFD, and code page 1449 (IBM default PUA) for the BMP PUA with IBM allocations in U+F83D–U+F8FF. * The file system found in Windows uses the U+F000 to U+F0FF block to escape
special characters As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets. This article includes the 1062 characters in the Multilingual European Character Set 2 (MES ...
. *
NetApp NetApp, Inc. is an American hybrid cloud data services and data management company headquartered in San Jose, California. It has ranked in the Fortune 500 from 2012–2021. Founded in 1992 with an IPO in 1995, NetApp offers cloud data service ...
translates characters in filenames that are allowed on Unix but invalid for SMB clients to PUA characters. *
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
's Chirp font provides some additional icons, like U+E000 which corresponds to a left down arrow, U+EA00 which corresponds to the Twitter bird, and U+F8FF which corresponds to an Apple logo, possibly for compatibility with Apple fonts.


Private-use characters in other character sets

The concept of reserving specific code points for Private Use is based on similar earlier usage in other character sets. In particular, many otherwise obsolete characters in East Asian scripts continue to be used in specific names or other situations, and so some character sets for those scripts made allowance for private-use characters (such as the user-defined planes of CNS 11643, or '' gaiji'' in certain Japanese encodings). The Unicode standard references these uses under the name "End User Character Definition" (EUCD). Additionally, the C1 control block contains two codes intended for private use "control functions" by ECMA-48: 0x91 (PU1) and 0x92 (PU2). Unicode includes these at and but defines them as control characters (category Cc), not private-use characters (category Co). Encodings which do not have private use areas but have more or less unused areas, such as
ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
and
Shift JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjuncti ...
, have seen uncontrolled variants of these encodings evolve. For Unicode, software companies can use the Private Use Areas for their desired additions.


Notes


References

{{DEFAULTSORT:Private Use (Unicode) * Articles with unsupported PUA characters