International Components for Unicode
   HOME

TheInfoList



OR:

International Components for Unicode (ICU) is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
project of mature C/
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
and
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
libraries for
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
support, software
internationalization In economics, internationalization or internationalisation is the process of increasing involvement of enterprises in international markets, although there is no agreed definition of internationalization. Internationalization is a crucial strateg ...
, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intenti ...
and sponsored, supported, and used by IBM and many other companies. ICU provides the following services:
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
text handling, full character properties, and
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
conversions; Unicode
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
s; full Unicode sets; character, word, and line boundaries; language-sensitive
collation Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fili ...
and searching; normalization, upper and lowercase conversion, and script
transliteration Transliteration is a type of conversion of a text from one writing system, script to another that involves swapping Letter (alphabet), letters (thus ''wikt:trans-#Prefix, trans-'' + ''wikt:littera#Latin, liter-'') in predictable ways, such as ...
s; comprehensive locale data and resource bundle architecture via the
Common Locale Data Repository The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating syst ...
(CLDR); multiple
calendar A calendar is a system of organizing days. This is done by giving names to periods of time, typically days, weeks, months and years. A date is the designation of a single and specific day within such a system. A calendar is also a physi ...
s and
time zone A time zone is an area which observes a uniform standard time for legal, Commerce, commercial and social purposes. Time zones tend to follow the boundaries between Country, countries and their Administrative division, subdivisions instead of ...
s; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided
complex text layout Complex text layout (CTL) or complex text rendering is the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes. The term is used in the field of software internationalizatio ...
service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of
HarfBuzz HarfBuzz (loose transliteration of Persian calque ''harf-bāz'', literally "open type") is a software development library for text shaping, which is the process of converting Unicode text to glyph indices and positions. The newer version, ''N ...
. ICU provides more extensive internationalization facilities than the standard libraries for C and C++.
Unicode 14.0 Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
is supported. ICU 67 supports
Unicode 13.0 Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
and handles removal of Great Britain from the EU. ICU 64 supports
Unicode 12.0 Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
, while ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese
Reiwa is the current Japanese era name, era of Japan's official calendar. It began on 1 May 2019, the day on which Emperor Akihito's elder son, Naruhito, Enthronement of the Japanese emperor, ascended the throne as the 126th Emperor of Japan. The ...
era (but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...
and
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
. Support for
AIX Aix or AIX may refer to: Computing * AIX, a line of IBM computer operating systems *An Alternate Index, for a Virtual Storage Access Method Key Sequenced Data Set * Athens Internet Exchange, a European Internet exchange point Places Belgi ...
,
Solaris Solaris may refer to: Arts and entertainment Literature, television and film * ''Solaris'' (novel), a 1961 science fiction novel by Stanisław Lem ** ''Solaris'' (1968 film), directed by Boris Nirenburg ** ''Solaris'' (1972 film), directed by ...
and
z/OS z/OS is a 64-bit operating system for IBM z/Architecture mainframes, introduced by IBM in October 2000. It derives from and is the successor to OS/390, which in turn was preceded by a string of MVS versions.Starting with the earliest: * O ...
may also be limited in later versions (i.e. building depends on compiler support). ICU has been included as a standard component with
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
since
Windows 10 Windows 10 is a major release of Microsoft's Windows NT operating system. It is the direct successor to Windows 8.1, which was released nearly two years earlier. It was released to manufacturing on July 15, 2015, and later to retail on J ...
version 1703. ICU has historically used
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
, and still does only for Java; while for C/C++
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
is supported, including the correct handling of "illegal UTF-8". ICU 70 updated to Unicode 14 and added support for
emoji An emoji ( ; plural emoji or emojis) is a pictogram, logogram, ideogram or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed conversat ...
properties of strings and
C++20 C++20 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++20 replaced the prior version of the C++ standard, called C++17. The standard was technically finalized by WG21 at the meeting in Prague in February 2020, a ...
compilers. ICU 71 adds e.g. phrase-based line breaking for Japanese (earlier methods didn't work well for short Japanese text, such as in titles and headings) and support for Hindi written in Latin letters (hi_Latn), also referred to as "
Hinglish Hinglish, a portmanteau of Hindi and English, is the macaronic hybrid use of English and languages of the Indian subcontinent, and especially Hindi. It involves code-switching or translanguaging between these languages whereby they are freely i ...
" and updates to the time zone data version 2022a.


Origin and development

After
Taligent Taligent Inc. (a portmanteau of "talent" and "intelligent") was an American software company. Based on the Pink object-oriented operating system conceived by Apple in 1988, Taligent Inc. was incorporated as an Apple/IBM partnership in 1992, and ...
became part of IBM in early 1996,
Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, the ...
decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the
Java Development Kit The Java Development Kit (JDK) is a distribution of Java Technology by Oracle Corporation. It implements the Java Language Specification (JLS) and the Java Virtual Machine Specification (JVMS) and provides the Standard Edition (SE) of the Java ...
as part of the
JDK The Java Development Kit (JDK) is a distribution of Java Technology by Oracle Corporation. It implements the Java Language Specification (JLS) and the Java Virtual Machine Specification (JVMS) and provides the Standard Edition (SE) of the Java ...
1.1 internationalization
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
s. A large portion of this code still exists in the and packages. Further internationalization features were added with each later release of Java. The Java internationalization classes were then ported to C++ and C as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and
Common Locale Data Repository The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating syst ...
(CLDR). ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode. In May, 2016, the ICU project joined the Unicode consortium as technical committee ''ICU-TC'', and the library sources are now distributed under the Unicode license.


MessageFormat

A part of ICU is the MessageFormat class, a formatting system that allows for any number of arguments to control the plural form (, ) or more general switch-case-style selection () for things like
grammatical gender In linguistics, grammatical gender system is a specific form of noun class system, where nouns are assigned with gender categories that are often not related to their real-world qualities. In languages with grammatical gender, most or all nouns ...
. These statements can be nested. ICU MessageFormat was created by adding the plural and selection system to an identically-named system in
Java SE Java Platform, Standard Edition (Java SE) is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE). The platform uses Ja ...
.


Alternatives

An alternative for using ICU with
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, or to using it directly, is to use Boost.Locale, which is a C++ wrapper for ICU (while also allowing other backends). The claim for using it rather than ICU directly is that "is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API." Another claim, that ICU only supports UTF-16 (and thus a reason to avoid using ICU) is no longer true with ICU now also supporting UTF-8 for C and C++.


See also

*
Apple Advanced Typography Apple Advanced Typography (AAT) is Apple Inc.'s computer technology for advanced font rendering, supporting internationalization and complex features for typographers, a successor to Apple's little-used QuickDraw GX font technology of the mid-1 ...
*
Apple Type Services for Unicode Imaging The Apple Type Services for Unicode Imaging (ATSUI) is the set of services for rendering Unicode-encoded text introduced in Mac OS 8.5 and carried forward into Mac OS X. It replaced the WorldScript engine for legacy encodings. Obsolescence ...
*
GNU GetText In computing, gettext is an internationalization and localization (i18n and l10n) system commonly used for writing multilingual programs on Unix-like computer operating systems. One of the main benefits of gettext is that it separates programmi ...
*
Graphite (SIL) Graphite is a programmable Unicode-compliant smart font technology and rendering system developed by SIL International as free software, distributed under the terms of the GNU Lesser General Public License and the Common Public License. Capabil ...
*
NetRexx NetRexx is an open source, originally IBM's, variant of the REXX programming language to run on the Java virtual machine. It supports a classic REXX syntax, with no reserved keywords, along with considerable additions to support object-oriented ...
(ICU license) *
OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark ...
*
Pango Pango (stylized as Παν語) is a text (i.e. glyph) layout engine library which works with the HarfBuzz shaping engine for displaying multi-language text. Full-function rendering of text and cross-platform support is achieved when Pango is us ...
*
Uconv In computing, uconv is a command-line tool that is bundled with International Components for Unicode that converts text files between different character encodings. It is very similar to the ''iconv'' command that is part of the Single UNIX Specifi ...
*
Uniscribe Uniscribe is the Microsoft Windows set of services for rendering Unicode-encoded text, supporting complex text layout. It is implemented in the dynamic link library . Uniscribe has been released with Windows 2000 and Internet Explorer 5.0. In addi ...


References


External links

*
International Components for Unicode transliteration services

Online ICU editor
{{Unicode navigation Unicode Component-based software engineering Digital typography Pattern matching Internationalization and localization Free computer libraries