HOME

TheInfoList



OR:

A wide character is a computer character
datatype In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
that generally has a size greater than the traditional
8-bit In computer architecture, 8-bit integers or other data units are those that are 8 bits wide (1 octet). Also, 8-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers or data bu ...
character. The increased datatype size allows for the use of larger coded
character sets Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
.


History

During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
as their smallest datatype. The 7-bit
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character set became the industry standard method for encoding
alphanumeric Alphanumericals or alphanumeric characters are any collection of number characters and letters in a certain language. Sometimes such characters may be mistaken one for the other. Merriam-Webster suggests that the term "alphanumeric" may often ...
characters for teletype machines and
computer terminal A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. Most early computers only had a front panel to input or display ...
s. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the de facto datatype for computer systems storing ASCII characters in memory. Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of
English alphabet Modern English is written with a Latin-script alphabet consisting of 26 Letter (alphabet), letters, with each having both uppercase and lowercase forms. The word ''alphabet'' is a Compound (linguistics), compound of ''alpha'' and ''beta'', t ...
characters. 8-bit extensions such as IBM code page 37,
PETSCII PETSCII (PET Standard Code of Information Interchange), also known as CBM ASCII, is the character set used in Commodore Business Machines' 8-bit home computers. This character set was first used by the PET from 1977, and was subsequently use ...
and
ISO 8859 ISO/IEC 8859 is a joint International Organization for Standardization, ISO and International Electrotechnical Commission, IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC ...
became commonplace, offering terminal support for
Greek Greek may refer to: Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group *Greek language, a branch of the Indo-European language family **Proto-Greek language, the assumed last common ancestor of all kno ...
,
Cyrillic The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set. In 1989, the
International Organization for Standardization The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries. M ...
began work on the
Universal Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/ IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), w ...
(UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.


Relation to UCS and Unicode

A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
simply being two common character sets that encode more characters than an 8-bit wide numeric value (255 total) would allow.


Relation to multibyte characters

Just as earlier data transmission systems suffered from the lack of an 8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
that can use multiple bytes to encode a value that is too large for a single 8-bit symbol. The C standard distinguishes between ''multibyte'' encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from ''wide characters'', which are run-time representations of characters in single objects (typically, greater than 8 bits).


Size of a wide character

Early adoption of UCS-2 ("Unicode 1.0") led to common use of
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
in a number of platforms, most notably
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
,
.NET The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...
and
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
. In these systems, it is common to have a "wide character" ( in C/C++; in Java) type of 16-bits. These types do not always map directly to one "character", as surrogate pairs are required to store the full range of Unicode (1996, Unicode 2.0).
Unix-like A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
generally use a 32-bit to fit the 21-bit Unicode code point, as C90 prescribed. The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they ''prefer''. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters.


Programming specifics


C/C++

The C and C++ standard libraries include a number of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype wchar_t, which in the original C90 standard was defined as : "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5) Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
transformation formats, leaving wchar_t implementation-defined. The ISO/IEC 10646:2003
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
standard 4.0 says that: :"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
characters in some compilers."


Python

According to Python 2.7's documentation, the language sometimes uses wchar_t as the basis for its character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system. This distinction has been deprecated since Python 3.3, which introduced a flexibly-sized UCS1/2/4 storage for strings and formally aliased to wchar_t. Since Python 3.12 use of wchar_t, i.e. the Py_UNICODE
typedef typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (''alias'') for another data type, but does not create a new type, except in the obscure case of a qualified typedef of ...
, for Python strings (wstr in implementation) has been dropped and still as before an "
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
representation is created on demand and cached in the Unicode object."


References


External links


The Unicode Standard, Version 4.0 - online edition





Multibyte (3) Man Page @ FreeBSD.org

Multibyte and Wide Characters @ Microsoft Developer Network

Windows Character Sets @ Microsoft Developer Network

Unicode and Character Set Programming Reference @ Microsoft Developer Network

Keep multibyte character support simple @ EuroBSDCon, Beograd, September 25, 2016
{{DEFAULTSORT:Wide Character Character encoding C (programming language) C++