
Unicode input is method to add a specific
Unicode character to a
computer file
A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
; it is a common way to input characters not directly supported by a physical
keyboard. Characters can be entered either by selecting them from a display, by typing a certain sequence of keys on a physical keyboard, or by drawing the symbol by hand on
touch-sensitive screen. In contrast to
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
's 96 element
character set
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
(which it contains), Unicode encodes hundreds of thousands of
grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived from Ancient Greek ('write'), and the suffix ''-eme'' by analogy with ''phoneme'' and other emic units. The study of graphemes ...
s (characters) from almost all of the world's written languages and many other signs and symbols.
A Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from a
keyboard layout
A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...
which defines keys and their combinations only for a limited number of characters appropriate for a certain
locale.
Unicode numbers
Unicode characters are distinguished by
code point
A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s, which are
conventionally represented by "U+" followed by four, five or six
hexadecimal digit
Hexadecimal (also known as base-16 or simply hex) is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols ...
s, for example U+00AE or U+1D310. Characters in the
Basic Multilingual Plane
In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP), containing modern
scripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as
emoticon
An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s,
emoji
An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...
s,
playing card
A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a f ...
s and many
CJK characters
In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin lo ...
) have 5-digit codes.
Glyph availability
An application can display a character only if it can access a
computer font
A computer font is implemented as a digital data file containing a set of graphically related glyphs. A computer font is designed and created using a font editor. A computer font specifically designed for the computer screen, and not for printi ...
which contains a
glyph
A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
for that character.
[Andrew Marcuse]
"How to enter Unicode characters in Microsoft Windows"
Access date: September 13, 2012 Fonts usually have incomplete Unicode coverage; most only contain the glyphs needed to support a few
writing system
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
s. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform
font substitution
Font substitution is the process of using one typeface in place of another when the intended typeface either is not available or does not contain glyphs for the required characters.
Font substitution can be aided by:
* classifying fonts into ...
, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts.
If an application does not have access to a glyph for a required codepoint in the specified font, the character should be shown as the font's glyph . This often appears as an empty box, ☐ (nicknamed "
tofu
or bean curd is a food prepared by Coagulation (milk), coagulating soy milk and then pressing the resulting curds into solid white blocks of varying softness: ''silken'', ''soft'', ''firm'', and ''extra (or super) firm''. It originated in Chin ...
" based on the shape), a box with an X in it, ☒, a diamond with a question mark, �, or a box with a question mark in it, ⍰.
Techniques
Extended keyboard mapping
Most operating systems support extended
keyboard mapping
A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...
the facility to increase the repertoire of characters available using techniques such as
Alternate graphic
AltGr (also Alt Graph) is a modifier key found on computer keyboards. It is primarily used to type characters that are used less frequently in the language that the keyboard is designed for, such as foreign currency symbols, List of typographi ...
("AltGr") that gives a third and fourth meaning to every key;
Compose key
A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol.
For insta ...
(sometimes called multi key), a key on a
computer keyboard
A computer keyboard is a built-in or peripheral input device modeled after the typewriter keyboard which uses an arrangement of buttons or Push-button, keys to act as Mechanical keyboard, mechanical levers or Electronic switching system, electro ...
that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a
precomposed character
A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diac ...
or a symbol;
dead key
A dead key is a special kind of modifier key on a mechanical typewriter, or computer keyboard, that is typically used to attach a specific diacritic to a base letter (alphabet), letter. The dead key does not generate a (complete) grapheme, charact ...
s typically used to attach a specific
diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
to a base
letter; or indeed combinations of these.
These techniques facilitate entry of character sets beyond the basic set provided as standard with the computer.
Selection from a screen
Many systems provide a way to select Unicode characters visually.
ISO/IEC 14755 refers to this as a ''screen-selection entry method''.
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
has provided a Unicode version of the
Character Map program, appearing in the consumer edition since XP. This is limited to characters in the
Basic Multilingual Plane
In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. Starting with Windows 10 Microsoft Windows also contains so called "emoji keyboard". It can be started by holding down the
Windows key
The Windows key (also known as win, start, logo, flag or super key) is a keyboard key originally introduced on Microsoft's Natural Keyboard in 1994. Windows 95 used it to bring up the start menu and it then became a standard key on PC ke ...
and hitting the period or semicolon key. The emoji keyboard allows entering of emojis as well as symbols.
More advanced third-party tools of the same type are also available (a notable
freeware
Freeware is software, often proprietary, that is distributed at no monetary cost to the end user. There is no agreed-upon set of rights, license, or EULA that defines ''freeware'' unambiguously; every publisher defines its own rules for the free ...
example is
BabelMap, which supports all Unicode characters). On most
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
desktop environments, equivalent tools – such as
gucharmap (GNOME) or
kcharselect (KDE) – are available.
Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them.
It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.
Decimal input (Alt codes)
Some programs running in
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
, including recent versions of
Word
A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
and
Notepad
A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often Ruled paper, ruled and used for purposes such as note-taking, Diary, journaling or other writing, drawing, or scrapbooki ...
, can produce characters from their Unicode code points expressed in decimal and entered on the
numeric keypad
A numeric keypad, number pad, numpad, or ten key,
is the calculator-style group of ten numeric keys accompanied by other keys, usually on the far right side of computer keyboard. This grouping allows quick number entry with right hand, ...
with the key held down. For example, the
Euro sign
The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
has 20AC as its hexadecimal code point, which is 8364 in decimal, so will produce the symbol.
Decimal code points in the range 160 –255 must be entered with a leading zero (so that the
Windows code page
Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Win ...
is chosen) and furthermore the Windows code page
CP1252 must be used. For example, yields a , corresponding to its code point, but the character produced by depends on the , such as
Code page 437
Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or MS-DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (di ...
, and may yield a . Also through yield the characters assigned in rows 8 and 9 in the
CP1252 layout, rather than the
C1 control codes that are assigned to those numbers in Unicode.
In programs which were not designed to handle Alt codes over 255, the character retrieved usually corresponds to the
remainder
In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient ( integer division). In a ...
when the number is divided by 256.
The text editor
Vim allows characters to be specified by two-character mnemonics referred to as
digraphs. The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, associates "Gr" with .
See
below
Below may refer to:
*Earth
*Ground (disambiguation)
*Soil
*Floor
* Bottom (disambiguation)
*Less than
*Temperatures below freezing
*Hell or underworld
People with the surname
* Ernst von Below (1863–1955), German World War I general
* Fred Belo ...
for use of decimal code points in HTML.
Hexadecimal input
Clause 5.1 of
ISO/IEC 14755 describes a ''Basic method'' whereby a ''beginning sequence'' is followed by the
hex number representation of the
code point
A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
and the ''ending sequence''. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only the
Basic Multilingual Plane
In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
).
In Microsoft Windows
Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value called
EnableHexNumpad
to the
registry key
HKEY_CURRENT_USER\Control Panel\Input Method
and assigning the value data
1
to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier than
Windows Vista
Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft W ...
, users needed to reboot for it to start working.) Unicode characters can then be entered by holding down , and typing on the numeric keypad, followed by the hexadecimal code, and then releasing .
This may not work for 5-digit hexadecimal codes like . Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on.
In some applications (
Word
A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
,
Notepad
A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often Ruled paper, ruled and used for purposes such as note-taking, Diary, journaling or other writing, drawing, or scrapbooki ...
and
LibreOffice
LibreOffice () is a free and open-source office productivity software suite developed by The Document Foundation (TDF). It was created in 2010 as a fork of OpenOffice.org, itself a successor to StarOffice. The suite includes applications ...
programs) will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, entering
af1
followed by (or if using a French version) will produce '૱' (U+0AF1), but entering
a0000f1
followed by will produce 'añ' ('a' followed by character U+00F1).
This facility enables Unicode characters to be entered in other applications: one can create a desired character in Notepad, for example, and then
cut and paste it wherever desired.
In MacOS
Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose the ''Unicode Hex Input'' keyboard layout; in
OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources.
Holding down , one types the four-digit
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
Unicode code point and the equivalent character appears; one can then release the key.
Typing special and accented characters
Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by using surrogate pairs: holding down the key while entering the first surrogate, the , the second surrogate, then releasing the Option key.
In X11 (Linux and other Unix variants including ChromeOS)
In many applications one or both of the following methods work to directly input Unicode characters:
* Holding and typing followed by the hex digits, then releasing .
* Entering , releasing, then typing the hex digits and pressing (or or even, on some systems, pressing and releasing or ).
This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function.
In platform-independent applications
* In Emacs
Emacs (), originally named EMACS (an acronym for "Editor Macros"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, s ...
, invokes the command, which accepts input either via hex code point or unicode char name.
* In LibreOffice
LibreOffice () is a free and open-source office productivity software suite developed by The Document Foundation (TDF). It was created in 2010 as a fork of OpenOffice.org, itself a successor to StarOffice. The suite includes applications ...
5.1 onwards, the method described above for Windows works.
* In Opera
Opera is a form of History of theatre#European theatre, Western theatre in which music is a fundamental component and dramatic roles are taken by Singing, singers. Such a "work" (the literal translation of the Italian word "opera") is typically ...
versions that use the Presto layout engine
Presto was the browser engine of the Opera web browser from the release of Opera 7 on 28 January 2003, until the release of Opera 15 on 2 July 2013, at which time Opera switched to using the Blink engine that was originally created for Chromium. ...
—i.e. up to and including version 12.xx—, entering the hexadecimal number of the desired symbol or character and then pressing (alternative shortcut on macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
).
* In the Vim editor, in insert mode, the user first types (for codepoints up to 4 hex digits long; using for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows, may be required instead of .[Vim documentation: gui_w32](_blank)
/ref>)
* In AutoCAD
AutoCAD is a 2D and
3D computer-aided design (CAD) software application developed by Autodesk. It was first released in December 1982 for the CP/M and IBM PC platforms as a desktop app running on microcomputers with internal graphics control ...
or three shortcuts , , .
HTML
In HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
and XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
, character codes to be rendered as characters are prefixed by ampersand
The ampersand, also known as the and sign, is the logogram , representing the grammatical conjunction, conjunction "and". It originated as a typographic ligature, ligature of the letters of the word (Latin for "and").
Etymology
Tradi ...
and number sign
The symbol is known as the number sign, hash, (or in North America) the pound sign. The symbol has historically been used for a wide range of purposes including the designation of an ordinal number and as a Typographic ligature, ligatured abbre ...
(&#), and are followed by a semicolon (;). The code point can be either in decimal
The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers (''decimal fractions'') of th ...
or in hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a named entity.
''Example:'' In HTML/XML, the copyright sign © (U+00A9
) may be coded as:
* ©
(decimal code point)
* ©
(hexadecimal code point)
* ©
(entity name)
This works in many pieces of software that accept HTML markup, such as Thunderbird and Wikipedia editing.
See also
* ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
* Digraphs and trigraphs (programming)
In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters.
...
Notes
References
{{Unicode navigation
Input
Input methods