In
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
, a script is a collection of
letters and other written signs used to represent textual information in one or more
writing system
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
s. Some scripts support only one writing system and
language
Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
, for example,
Armenian. Other scripts support many different writing systems; for example, the
Latin script
The Latin script, also known as the Roman script, is a writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae in Magna Graecia. The Gree ...
supports
English,
French,
German,
Italian,
Vietnamese,
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
itself, and several other languages. Some languages make use of multiple alternate writing systems and thus also use several scripts; for example, in
Turkish, the
Arabic
Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
script was used before the 20th century but transitioned to Latin in the early part of the 20th century. More or less complementary to scripts are
symbols and Unicode
control characters.
The unified
diacritical characters and unified
punctuation characters frequently have the "common" or "inherited" script property. However, the individual scripts often have their own
punctuation
Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
and
diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
s, so that many scripts include not only letters but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and
space
Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...
characters.
Unicode defines 168 separate scripts, including 99 modern scripts and 69 ancient or historic scripts. More scripts are in the process for encoding or have been tentatively allocated for encoding in roadmaps.
Definition and classification
When multiple languages make use of the same script, there are frequently some differences, particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However,
Swedish includes the character ''
ĂĄ'' (sometimes called a Swedish ''O''), while English has no such character. Nor does English make use of the diacritic ''
combining ring above'' for any character. In general, the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems, they are said to use the same Latin script. Thus, the Unicode abstraction of scripts is a basic organizing technique. The differences among different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.
Script versus writing system
''
Writing system
A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
'' is sometimes treated as a synonym for "script". However, it also can be used as the specific concrete writing system supported by a script. For example, the
Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script; for example, the Japanese writing system makes use of the
Han,
Hiragana and
Katakana scripts.
Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term ''
complex system
A complex system is a system composed of many components that may interact with one another. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication sy ...
'' is sometimes used to describe those where the admixture makes classification problematic.
Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text-processing algorithms.
Special script property values
In addition to explicit or specific script properties, Unicode uses three special values:
;Common: Unicode can assign a character in the
UCS to a single script only. However, many characters—those that are not part of a formal natural-language writing system or are unified across many writing systems—may be used in more than one script (for example, currency signs, symbols, numerals and punctuation marks). In these cases Unicode defines them as belonging to the "common" script (
ISO 15924 code "Zyyy").
;Inherited: Many diacritics and non-spacing combining characters may be applied to characters from more than one script. In these cases Unicode assigns them to the "inherited" script (ISO 15924 code Zinh), which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, may combine either with to create a Latin ''Ă«'' or with for the
Cyrillic
The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...
''Ń‘''. In the former case, it inherits the Latin script of the base character, whereas in the latter case, it inherits the Cyrillic script of the base character.
;Unknown: The value of "unknown" script (ISO 15924 code Zzzz) is given to unassigned, private-use, noncharacter, and surrogate code points.
Character categories within scripts
Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letters and modifier letters. Some characters are considered titlecase letters for a few
precomposed ligatures such as Dz (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all
compatibility characters, and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.
Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as "other letter" or "modifier letter". Ideographs such as Unihan ideographs are also categorized as "other letters". A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.
Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that script. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.
List of encoded scripts
, Unicode defines 168 scripts (called "Alias" or "Property value alias") based on the ISO 15924 list. In addition, Unicode assigns the name "Common" to ISO 15924's code for undetermined scripts, "Inherited" to ISO 15924's code for inherited scripts, and "Unknown" to ISO 15924's code for uncoded scripts. There are script codes defined by ISO 15924 but are not used in Unicode, including (Symbols) and (Mathematical notation).
Missing scripts in Unicode
The project Missing Scripts—with contributors from the
Mainz University of Applied Sciences, the L’Atelier national de recherche typographique (ANRT) in
Nancy, and the
University of California, Berkeley
The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California), is a Public university, public Land-grant university, land-grant research university in Berkeley, California, United States. Founded in 1868 and named after t ...
—has compiled a list of 131 scripts that have not yet been encoded in ''The Unicode Standard'', out of a total of 294 recognized scripts according to the current state of research.
See also
*
Latin script in Unicode
*
Unicode characters
*
Unicode symbols
*
Phonemic and phonetic orthography
References
External links
Script Encoding Initiative A project at UC Berkeley, USA, working to get more scripts included in the Unicode standard.
The World’s Writing Systems An overview of all 294 known writing systems, each with a typographic reference glyph and their Unicode status.
{{Writing systems
Scripts