HOME

TheInfoList



OR:

Unicode input is method to add a specific Unicode character to a
computer file A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
; it is a common way to input characters not directly supported by a physical keyboard. Characters can be entered either by selecting them from a display, by typing a certain sequence of keys on a physical keyboard, or by drawing the symbol by hand on touch-sensitive screen. In contrast to
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
's 96 element
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
(which it contains), Unicode encodes hundreds of thousands of
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived from Ancient Greek ('write'), and the suffix ''-eme'' by analogy with ''phoneme'' and other emic units. The study of graphemes ...
s (characters) from almost all of the world's written languages and many other signs and symbols. A Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from a
keyboard layout A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...
which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.


Unicode numbers

Unicode characters are distinguished by
code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s, which are conventionally represented by "U+" followed by four, five or six
hexadecimal digit Hexadecimal (also known as base-16 or simply hex) is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols ...
s, for example U+00AE or U+1D310. Characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP), containing modern scripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as
emoticon An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s,
emoji An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...
s,
playing card A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a f ...
s and many
CJK characters In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin lo ...
) have 5-digit codes.


Glyph availability

An application can display a character only if it can access a
computer font A computer font is implemented as a digital data file containing a set of graphically related glyphs. A computer font is designed and created using a font editor. A computer font specifically designed for the computer screen, and not for printi ...
which contains a
glyph A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
for that character.Andrew Marcuse
"How to enter Unicode characters in Microsoft Windows"
Access date: September 13, 2012
Fonts usually have incomplete Unicode coverage; most only contain the glyphs needed to support a few
writing system A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
s. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform
font substitution Font substitution is the process of using one typeface in place of another when the intended typeface either is not available or does not contain glyphs for the required characters. Font substitution can be aided by: * classifying fonts into ...
, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts. If an application does not have access to a glyph for a required codepoint in the specified font, the character should be shown as the font's glyph . This often appears as an empty box, ☐ (nicknamed "
tofu or bean curd is a food prepared by Coagulation (milk), coagulating soy milk and then pressing the resulting curds into solid white blocks of varying softness: ''silken'', ''soft'', ''firm'', and ''extra (or super) firm''. It originated in Chin ...
" based on the shape), a box with an X in it, ☒, a diamond with a question mark, �, or a box with a question mark in it, ⍰.


Techniques


Extended keyboard mapping

Most operating systems support extended
keyboard mapping A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...
the facility to increase the repertoire of characters available using techniques such as
Alternate graphic AltGr (also Alt Graph) is a modifier key found on computer keyboards. It is primarily used to type characters that are used less frequently in the language that the keyboard is designed for, such as foreign currency symbols, List of typographi ...
("AltGr") that gives a third and fourth meaning to every key;
Compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For insta ...
(sometimes called multi key), a key on a
computer keyboard A computer keyboard is a built-in or peripheral input device modeled after the typewriter keyboard which uses an arrangement of buttons or Push-button, keys to act as Mechanical keyboard, mechanical levers or Electronic switching system, electro ...
that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a
precomposed character A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diac ...
or a symbol;
dead key A dead key is a special kind of modifier key on a mechanical typewriter, or computer keyboard, that is typically used to attach a specific diacritic to a base letter (alphabet), letter. The dead key does not generate a (complete) grapheme, charact ...
s typically used to attach a specific
diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
to a base letter; or indeed combinations of these. These techniques facilitate entry of character sets beyond the basic set provided as standard with the computer.


Selection from a screen

Many systems provide a way to select Unicode characters visually. ISO/IEC 14755 refers to this as a ''screen-selection entry method''.
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
has provided a Unicode version of the Character Map program, appearing in the consumer edition since XP. This is limited to characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. Starting with Windows 10 Microsoft Windows also contains so called "emoji keyboard". It can be started by holding down the
Windows key The Windows key (also known as win, start, logo, flag or super key) is a keyboard key originally introduced on Microsoft's Natural Keyboard in 1994. Windows 95 used it to bring up the start menu and it then became a standard key on PC ke ...
and hitting the period or semicolon key. The emoji keyboard allows entering of emojis as well as symbols. More advanced third-party tools of the same type are also available (a notable
freeware Freeware is software, often proprietary, that is distributed at no monetary cost to the end user. There is no agreed-upon set of rights, license, or EULA that defines ''freeware'' unambiguously; every publisher defines its own rules for the free ...
example is BabelMap, which supports all Unicode characters). On most
Linux Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
desktop environments, equivalent tools – such as gucharmap (GNOME) or kcharselect (KDE) – are available. Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them. It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.


Decimal input (Alt codes)

Some programs running in
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
, including recent versions of
Word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
and
Notepad A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often Ruled paper, ruled and used for purposes such as note-taking, Diary, journaling or other writing, drawing, or scrapbooki ...
, can produce characters from their Unicode code points expressed in decimal and entered on the
numeric keypad A numeric keypad, number pad, numpad, or ten key, is the calculator-style group of ten numeric keys accompanied by other keys, usually on the far right side of computer keyboard. This grouping allows quick number entry with right hand, ...
with the key held down. For example, the
Euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
has 20AC as its hexadecimal code point, which is 8364 in decimal, so will produce the symbol. Decimal code points in the range 160 –255 must be entered with a leading zero (so that the
Windows code page Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Win ...
is chosen) and furthermore the Windows code page CP1252 must be used. For example, yields a , corresponding to its code point, but the character produced by depends on the , such as
Code page 437 Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or MS-DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (di ...
, and may yield a . Also through yield the characters assigned in rows 8 and 9 in the CP1252 layout, rather than the C1 control codes that are assigned to those numbers in Unicode. In programs which were not designed to handle Alt codes over 255, the character retrieved usually corresponds to the
remainder In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient ( integer division). In a ...
when the number is divided by 256. The text editor Vim allows characters to be specified by two-character mnemonics referred to as digraphs. The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, associates "Gr" with . See
below Below may refer to: *Earth *Ground (disambiguation) *Soil *Floor * Bottom (disambiguation) *Less than *Temperatures below freezing *Hell or underworld People with the surname * Ernst von Below (1863–1955), German World War I general * Fred Belo ...
for use of decimal code points in HTML.


Hexadecimal input

Clause 5.1 of ISO/IEC 14755 describes a ''Basic method'' whereby a ''beginning sequence'' is followed by the hex number representation of the
code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
and the ''ending sequence''. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only the
Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
).


In Microsoft Windows

Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value called EnableHexNumpad to the registry key HKEY_CURRENT_USER\Control Panel\Input Method and assigning the value data 1 to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier than
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft W ...
, users needed to reboot for it to start working.) Unicode characters can then be entered by holding down , and typing on the numeric keypad, followed by the hexadecimal code, and then releasing . This may not work for 5-digit hexadecimal codes like . Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on. In some applications (
Word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
,
Notepad A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often Ruled paper, ruled and used for purposes such as note-taking, Diary, journaling or other writing, drawing, or scrapbooki ...
and
LibreOffice LibreOffice () is a free and open-source office productivity software suite developed by The Document Foundation (TDF). It was created in 2010 as a fork of OpenOffice.org, itself a successor to StarOffice. The suite includes applications ...
programs) will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, entering af1 followed by (or if using a French version) will produce '૱' (U+0AF1), but entering a0000f1 followed by will produce 'añ' ('a' followed by character U+00F1). This facility enables Unicode characters to be entered in other applications: one can create a desired character in Notepad, for example, and then cut and paste it wherever desired.


In MacOS

Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose the ''Unicode Hex Input'' keyboard layout; in OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources. Holding down , one types the four-digit
hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
Unicode code point and the equivalent character appears; one can then release the key.Typing special and accented characters
Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by using surrogate pairs: holding down the key while entering the first surrogate, the , the second surrogate, then releasing the Option key.


In X11 (Linux and other Unix variants including ChromeOS)

In many applications one or both of the following methods work to directly input Unicode characters: * Holding and typing followed by the hex digits, then releasing . * Entering , releasing, then typing the hex digits and pressing (or or even, on some systems, pressing and releasing or ). This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function.


In platform-independent applications

* In
Emacs Emacs (), originally named EMACS (an acronym for "Editor Macros"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, s ...
, invokes the command, which accepts input either via hex code point or unicode char name. * In
LibreOffice LibreOffice () is a free and open-source office productivity software suite developed by The Document Foundation (TDF). It was created in 2010 as a fork of OpenOffice.org, itself a successor to StarOffice. The suite includes applications ...
5.1 onwards, the method described above for Windows works. * In
Opera Opera is a form of History of theatre#European theatre, Western theatre in which music is a fundamental component and dramatic roles are taken by Singing, singers. Such a "work" (the literal translation of the Italian word "opera") is typically ...
versions that use the
Presto layout engine Presto was the browser engine of the Opera web browser from the release of Opera 7 on 28 January 2003, until the release of Opera 15 on 2 July 2013, at which time Opera switched to using the Blink engine that was originally created for Chromium. ...
—i.e. up to and including version 12.xx—, entering the hexadecimal number of the desired symbol or character and then pressing (alternative shortcut on
macOS macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
). * In the Vim editor, in insert mode, the user first types (for codepoints up to 4 hex digits long; using for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows, may be required instead of .Vim documentation: gui_w32
/ref>) * In
AutoCAD AutoCAD is a 2D and 3D computer-aided design (CAD) software application developed by Autodesk. It was first released in December 1982 for the CP/M and IBM PC platforms as a desktop app running on microcomputers with internal graphics control ...
or three shortcuts , , .


HTML

In
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
and
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
, character codes to be rendered as characters are prefixed by
ampersand The ampersand, also known as the and sign, is the logogram , representing the grammatical conjunction, conjunction "and". It originated as a typographic ligature, ligature of the letters of the word (Latin for "and"). Etymology Tradi ...
and
number sign The symbol is known as the number sign, hash, (or in North America) the pound sign. The symbol has historically been used for a wide range of purposes including the designation of an ordinal number and as a Typographic ligature, ligatured abbre ...
(&#), and are followed by a semicolon (;). The code point can be either in
decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers (''decimal fractions'') of th ...
or in
hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a named entity. ''Example:'' In HTML/XML, the copyright sign © (U+00A9) may be coded as: * © (decimal code point) * © (hexadecimal code point) * © (entity name) This works in many pieces of software that accept HTML markup, such as Thunderbird and Wikipedia editing.


See also

*
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
*
Digraphs and trigraphs (programming) In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters. ...


Notes


References

{{Unicode navigation Input Input methods