__NOTOC__
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from
strings representing text in any
writing system
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
and
language
Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
that can be represented with
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
. These keys can then be efficiently compared byte by byte in order to
collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.
Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,
and some such customizations can be found in the Unicode
Common Locale Data Repository
The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to ...
(CLDR).
An open source implementation of UCA is included with the
International Components for Unicode
International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
, ICU. ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.
See also
*
Collation
*
ISO/IEC 14651
*
European ordering rules (EOR)
*
Common Locale Data Repository
The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to ...
(CLDR)
References
External links
Unicode Collation Algorithm Unicode Technical Standard #10
Mimer SQL Unicode Collation Charts
Tools
ICU Locale ExplorerAn online demonstration of the Unicode Collation Algorithm using
International Components for Unicode
International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and envir ...
An ICU collation demoA sort program that provides an unusual level of flexibility in defining collations and extracting keys.
String collation algorithms
Collation
Collation
{{standard-stub