A wide character is a computer
character datatype
In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
that generally has a size greater than the traditional
8-bit
In computer architecture, 8-bit integers or other data units are those that are 8 bits wide (1 octet). Also, 8-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers or data bu ...
character. The increased datatype size allows for the use of larger coded
character sets
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
.
History
During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
as their smallest datatype. The 7-bit
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character set became the industry standard method for encoding
alphanumeric
Alphanumericals or alphanumeric characters are any collection of number characters and letters in a certain language. Sometimes such characters may be mistaken one for the other.
Merriam-Webster suggests that the term "alphanumeric" may often ...
characters for
teletype machines and
computer terminal
A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. Most early computers only had a front panel to input or display ...
s. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the
de facto datatype for computer systems storing ASCII characters in memory.
Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of
English alphabet
Modern English is written with a Latin-script alphabet consisting of 26 Letter (alphabet), letters, with each having both uppercase and lowercase forms. The word ''alphabet'' is a Compound (linguistics), compound of ''alpha'' and ''beta'', t ...
characters.
8-bit extensions such as IBM code page 37,
PETSCII
PETSCII (PET Standard Code of Information Interchange), also known as CBM ASCII, is the character set used in Commodore Business Machines' 8-bit home computers.
This character set was first used by the PET from 1977, and was subsequently use ...
and
ISO 8859
ISO/IEC 8859 is a joint International Organization for Standardization, ISO and International Electrotechnical Commission, IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC ...
became commonplace, offering terminal support for
Greek
Greek may refer to:
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group
*Greek language, a branch of the Indo-European language family
**Proto-Greek language, the assumed last common ancestor of all kno ...
,
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.
In 1989, the
International Organization for Standardization
The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries.
M ...
began work on the
Universal Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), w ...
(UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.
Relation to UCS and Unicode
A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
simply being two common character sets that encode more characters than an 8-bit wide numeric value (255 total) would allow.
Relation to multibyte characters
Just as earlier data transmission systems suffered from the lack of an
8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
that can use
multiple bytes to encode a value that is too large for a single 8-bit symbol.
The
C standard distinguishes between ''multibyte'' encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from ''wide characters'', which are
run-time representations of characters in single objects (typically, greater than 8 bits).
Size of a wide character
Early adoption of
UCS-2 ("Unicode 1.0") led to common use of
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
in a number of platforms, most notably
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
,
.NET
The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...
and
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. In these systems, it is common to have a "wide character" ( in C/C++; in Java) type of 16-bits. These types do not always map directly to one "character", as
surrogate pairs are required to store the full range of Unicode (1996, Unicode 2.0).
Unix-like
A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
generally use a 32-bit to fit the 21-bit Unicode code point, as C90 prescribed.
The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they ''prefer''. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters.
Programming specifics
C/C++
The
C and
C++ standard libraries include
a number of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype
wchar_t
, which in the original
C90 standard was defined as
: "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)
Both C and
C++ introduced fixed-size character types
char16_t
and
char32_t
in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
transformation formats, leaving
wchar_t
implementation-defined. The ISO/IEC 10646:2003
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
standard 4.0 says that:
:"The width of
wchar_t
is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use
wchar_t
for storing Unicode text. The
wchar_t
type is intended for storing compiler-defined wide characters, which may be
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
characters in some compilers."
Python
According to
Python 2.7's documentation, the language sometimes uses
wchar_t
as the basis for its character type
Py_UNICODE
. It depends on whether
wchar_t
is "compatible with the chosen Python Unicode build variant" on that system. This distinction has been deprecated since Python 3.3, which introduced a flexibly-sized UCS1/2/4 storage for strings and formally aliased to
wchar_t
. Since Python 3.12 use of
wchar_t
, i.e. the
Py_UNICODE
typedef
typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (''alias'') for another data type, but does not create a new type, except in the obscure case of a qualified typedef of ...
, for Python strings (wstr in implementation) has been dropped and still as before an "
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
representation is created on demand and cached in the Unicode object."
References
External links
The Unicode Standard, Version 4.0 - online editionMultibyte (3) Man Page @ FreeBSD.orgMultibyte and Wide Characters @ Microsoft Developer NetworkWindows Character Sets @ Microsoft Developer NetworkUnicode and Character Set Programming Reference @ Microsoft Developer NetworkKeep multibyte character support simple @ EuroBSDCon, Beograd, September 25, 2016
{{DEFAULTSORT:Wide Character
Character encoding
C (programming language)
C++