Tamil All Character Encoding (TACE16) is a scheme for
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
the
Tamil script
The Tamil script ( ) is an abugida script that is used by Tamils and Tamil language, Tamil speakers in India, Sri Lanka, Malaysia, Singapore and elsewhere to write the Tamil language. It is one of the official scripts of the Indian Republic. ...
in the
Private Use Area
In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three Private Use Areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearly covering ...
of
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
, implementing a
syllabary
In the Linguistics, linguistic study of Written language, written languages, a syllabary is a set of grapheme, written symbols that represent the syllables or (more frequently) mora (linguistics), morae which make up words.
A symbol in a syllaba ...
-based character model differing from the modified-
ISCII
Indian Standard Code for Information Interchange (ISCII) is a coding scheme for representing various writing systems of India. It encodes the main Indic scripts and a Roman transliteration. The supported scripts are: Eastern Nagari, Bengali–Ass ...
model used by
Unicode's existing Tamil implementation.
Keyboard drivers and fonts
The keyboard driver for this encoding scheme is available on the
Tamil Virtual Academy
Tamil Virtual Academy, formerly known as the Tamil Virtual University, is a distance education institution based in Chennai, Tamil Nadu, India. The Government of Tamil Nadu established the Tamil Virtual University on 17 February 2001 as a so ...
website for free.
[Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts](_blank)
It uses
Tamil 99 and Tamil Typewriter
keyboard layout
A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...
s, which are approved by the
Government of Tamil Nadu
The Government of Tamil Nadu () is the administrative body responsible for the governance of the Indian state of Tamil Nadu. Chennai is the capital of the state and houses the state executive, legislature and head of judiciary.
Under the Const ...
, and maps the input keystrokes to its corresponding characters of the TACE16 scheme.
To read files created using TACE16, the corresponding Unicode Tamil fonts are also available on the same website.
These fonts map glyphs for characters of TACE16 format, but also for the
Unicode block
A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes. Typically, proposals such as the ...
for both
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
and
Tamil characters, so that they can provide
backward compatibility
In telecommunications and computing, backward compatibility (or backwards compatibility) is a property of an operating system, software, real-world product, or technology that allows for interoperability with an older legacy system, or with Input ...
for reading existing files which are created using the
Tamil Unicode block.
Character set
All the characters of this encoding scheme are located in the
private use area
In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three Private Use Areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearly covering ...
of the
Basic Multilingual Plane
In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
of
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
's
Universal Coded Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of character (computing), characters defined by the international standard International Organization for Standardization, ISO/International Electrotechnical Commission, IEC  ...
.
Comparison of TACE16 to present Tamil Unicode
Criticism of the standard Unicode character model for Tamil
The
existing Unicode character model for Tamil is, like most of
Indic Unicode, an
abugida
An abugida (; from Geʽez: , )sometimes also called alphasyllabary, neosyllabary, or pseudo-alphabetis a segmental Writing systems#Segmental writing system, writing system in which consonant–vowel sequences are written as units; each unit ...
-based model derived from
ISCII
Indian Standard Code for Information Interchange (ISCII) is a coding scheme for representing various writing systems of India. It encodes the main Indic scripts and a Roman transliteration. The supported scripts are: Eastern Nagari, Bengali–Ass ...
. It been criticized for several reasons.
Unicode represents only 31 Tamil base characters as single
code point
A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s, out of 247
grapheme cluster
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, official ...
s. These include stand-alone vowels, and 23 basic consonant glyphs (which, due to not bearing a
virama
Virama ( ्, ) is a Sanskrit phonological concept to suppress the inherent vowel that otherwise occurs with every consonant letter, commonly used as a generic term for a codepoint in Unicode, representing either
# halanta, hasanta or explicit vir ...
, nonetheless denote a syllable with both a consonant and a vowel when used on their own). The others are represented as sequences of code points, requiring software support for advanced typography features (such as
Apple Advanced Typography
Apple Advanced Typography (AAT) is Apple Inc.'s computer technology for advanced font rendering, supporting internationalization and complex features for typographers, a successor to Apple's little-used QuickDraw GX font technology of the mid ...
,
Graphite
Graphite () is a Crystallinity, crystalline allotrope (form) of the element carbon. It consists of many stacked Layered materials, layers of graphene, typically in excess of hundreds of layers. Graphite occurs naturally and is the most stable ...
, or
OpenType advanced typography) to render correctly. This also requires the use of invisible
zero-width joiner
The zero-width joiner (ZWJ, ; rendered: ; HTML entity: or ) is a non-printing character used in the computerized typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex ...
and
zero-width non-joiner
The zero-width non-joiner (ZWNJ, ; rendered: ; HTML entity: or ) is a non-printing character used in the computerization of writing systems that make use of Typographic ligature, ligatures. For example, in writing systems that feature initial, ...
characters in places where the desired grapheme cluster would otherwise be ambiguous. This complexity can result in security vulnerabilities and ambiguous combinations, can require the use of an exception table to forbid invalid combinations of code points, and can necessitate the use of
string normalization to compare two
strings for equality.
Additionally, since syllables with both a consonant and a vowel form 64 to 70% of Tamil text, an abugida-based model which encodes the consonant and vowel parts as separate code points is inefficient, in terms of how long a string needs to be to contain a given piece of text, in comparison with a syllabary-based model.
Furthermore, ISCII is primarily an encoding of
Devanagari
Devanagari ( ; in script: , , ) is an Indic script used in the Indian subcontinent. It is a left-to-right abugida (a type of segmental Writing systems#Segmental systems: alphabets, writing system), based on the ancient ''Brāhmī script, Brā ...
, and the ISCII encodings of other
Brahmic scripts
The Brahmic scripts, also known as Indic scripts, are a family of abugida writing systems. They are used throughout South Asia, Southeast Asia and parts of East Asia. They are descended from the Brahmi script of ancient India and are used b ...
(including Tamil) encode characters over the code points of the corresponding characters in Devanagari ISCII. Although Unicode encodes the Brahmic scripts separately from one another, the Tamil block mirrors the ISCII layout (with Devanagari-style character ordering, and reserved space in positions corresponding to Devanagari characters with no Tamil equivalent); consequently, the characters are not in the natural sequence order, and strings
collated
Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fil ...
by code point (analogous to "
ASCIIbetical" sorting of English text) will not produce the expected sorting order. It requires
a complex collation algorithm for arranging them in the natural order.
TACE16 in comparison
The following data provides a comparison of current Unicode Tamil vs. TACE16 on e-governance and browsing:
* TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent for
data storage
Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are con ...
.
* TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent for sorting index data.
* TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is Tamil. The default collation sequence followed (binary) while using the code-space values in TACE16 is not as per Tamil dictionary order.
* TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent.
* Index creation on TACE16 data is faster by 36.7% than Unicode.
* For full key search on indexed fields, TACE16 performs better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields, TACE16 performs better than Unicode Tamil by up to 20.9%.
* Rendering of static Tamil data works with TACE16.
TACE16 provides performance improvements in processing time and processing space. It encompasses all of the general Tamil text; it is sequential; and it is unambiguous, with any point corresponding to only one character.
The TACE16 system takes fewer
instruction cycles than Unicode Tamil, and also allows programming based on Tamil grammar, which needs extra framework development in Unicode Tamil.
Responses by the Unicode Consortium
The
Unicode Consortium
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...
publishes a dedicated
FAQ
A frequently asked questions (FAQ) list is often used in articles, websites, email lists, and online forums where common questions tend to recur, for example through posts or queries by new users related to common knowledge gaps. The purpose of a ...
page on the Tamil script which responds to some of the criticisms. In defence of the ISCII model, the Consortium notes that expert
linguist
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
s,
typographer
Typography is the art and technique of Typesetting, arranging type to make written language legibility, legible, readability, readable and beauty, appealing when displayed. The arrangement of type involves selecting typefaces, Point (typogra ...
s and programmers were involved in its development, but acknowledges that compromises were made due to ISCII being constrained to single-byte
extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
. The Consortium points out that Unicode Tamil is now implemented by all major
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
s and
web browser
A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...
s, and maintains that it should be used in open interchange contexts, such as online, since tools such as
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s would not necessarily be able to identify or interpret a sequence of Unicode private-use code points as Tamil text. However, the Consortium does not object to the use of Private-Use Area schemes, including TACE16, internally to particular processes for which they are useful. In particular, it highlights that both
markup schemes and alternative encoding schemes may be used by researchers for specialised purposes such as
natural-language processing.
Unicode defines normative named-sequences for all Tamil pure consonants and syllables which are represented with sequences of more than one code point, and a dedicated table is published as part of the Unicode Standard listing all of these sequences, in their traditional order, along with their correct glyphs. The Consortium points out that it has been open to accepting proposals for characters for which existing Unicode representation exists: for example, adding several historical fractions and other symbols as the
Tamil Supplement
Tamil Supplement is a Unicode block containing Tamil
Tamil may refer to:
People, culture and language
* Tamils, an ethno-linguistic group native to India, Sri Lanka, and some other parts of Asia
**Sri Lankan Tamils, Tamil people native to Sri ...
block in version 12.0 in 2019.
Regarding collation, the Consortium argues that obtaining the correct result from sorting by code point is the exception rather than the rule, highlighting that, in unmodified
ASCIIbetical ordering, the uppercase Latin letter ''Z'' sorts before the lowercase letter ''a'', and also highlighting that collation rules often differ by language (see e.g.
ö). Regarding space efficiency, the Consortium argues that storage space and bandwidth taken up by text is usually far overshadowed by other accompanying media such as images and video, and that text content performs well under general-purpose compression methods such as
Deflate (originally from the
ZIP file format
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is t ...
, standardized in RFC 1951 and integrated in the HTTP protocol as a generic encoding scheme).
Unicode Stability Policy
When first published (version 1.0.0), Unicode made only limited stability guarantees. As such, the
original Tibetan block was deleted in version 1.0.1 (and its space has since been occupied by the
Myanmar block), and the
original block for Korean syllables was deleted in version 2.0 (and is now occupied by
CJK Unified Ideographs Extension A
__FORCETOC__
CJK Unified Ideographs Extension-A is a Unicode block
A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for adminis ...
). Both the current
Hangul Syllables block for Korean syllables, and the current
Tibetan block, date back to Unicode 2.0. This was done on the assumption that little or no existing content using Unicode for those writing systems existed,
since it would break compatibility with all existing Unicode content in, and
input method
An input method (or input method editor, commonly abbreviated IME) is an operating system component or program that enables users to generate characters not natively available on their input devices by using sequences of characters (or mouse oper ...
s for, those writing systems. After this so-dubbed "Korean mess", the responsible committees pledged not to make such a compatibility-breaking change ever again,
which now forms part of the Unicode Stability Policy.
This stability policy has been upheld ever since, in spite of demands to re-encode or change the character model for both Tibetan and Korean a second time, made by
China
China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
and
North Korea
North Korea, officially the Democratic People's Republic of Korea (DPRK), is a country in East Asia. It constitutes the northern half of the Korea, Korean Peninsula and borders China and Russia to the north at the Yalu River, Yalu (Amnok) an ...
respectively.
Likewise in relation to Tamil, the Consortium emphasises the "crucial issue of maintaining the stability of the standard for existing implementations", and argues that "the resulting costs and impact of destabilizing the standard" would substantially outweigh any efficiency benefits in processing speed or storage space.
There was a proposal to re-encode Tamil that was rejected by Unicode, who said that the re-encoding would be damaging and that there was no convincing evidence that Unicode Tamil encoding is deficient.
Alternatives
Open-Tamil
The Open-Tamil project
provides many of the common operations. It claims Level-1 compliance of Tamil text processing without using TACE16, but is written on top of extra programming logic which is needed for Unicode Tamil.
See also
*
Clip font
*
Tamil Script Code for Information Interchange
Tamil Script Code for Information Interchange (TSCII) is a coding scheme for representing the Tamil script. The lower 128 codepoints are plain American Standard Code for Information Interchange, ASCII, the upper 128 codepoints are TSCII-specific. ...
*
Tamil keyboard
The Tamil language, Tamil keyboard layout, keyboard is used in computers and mobile devices to input text in the Tamil script.
The keyboard layout approved by the Government of Tamil Nadu is Tamil 99. The InScript keyboard is the keyboard layout ...
*
தமிழ் 99
*
InScript
InScript (short for Indic Script) is the decreed standard keyboard layout for Indian scripts using a standard 104- or 105-key layout. This keyboard layout was standardised by the Government of India for inputting text in languages of India writ ...
*
Tamil (Unicode block)
Tamil is a Unicode block containing characters for the Tamil, and Saurashtra languages of Tamil Nadu India, Sri Lanka, Singapore, and Malaysia. In its original incarnation, the code points U+0B82..U+0BCD were a direct copy of the Tamil characters ...
*
Tamil blogosphere
The Tamil blogosphere is the online community of Tamil-language weblogs that are a part of the larger Indian blogosphere. The Tamil blogosphere has a considerable number of contributors from Sri Lanka and Singapore, and is one of the largest bl ...
AnyTaFont2UTF8– an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
project for all Tamil encoding/font mapping characters.
Footnotes
References
{{Character encoding
Tamil character-encoding standards
Character sets