Tamil All Character Encoding (TACE16) is a scheme for

encoding In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...

the

Tamil script The Tamil script ( ) is an abugida script that is used by Tamils and Tamil language, Tamil speakers in India, Sri Lanka, Malaysia, Singapore and elsewhere to write the Tamil language. It is one of the official scripts of the Indian Republic. ...

in the

Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three Private Use Areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearly covering ...

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

, implementing a

syllabary In the Linguistics, linguistic study of Written language, written languages, a syllabary is a set of grapheme, written symbols that represent the syllables or (more frequently) mora (linguistics), morae which make up words. A symbol in a syllaba ...

-based character model differing from the modified-

ISCII Indian Standard Code for Information Interchange (ISCII) is a coding scheme for representing various writing systems of India. It encodes the main Indic scripts and a Roman transliteration. The supported scripts are: Eastern Nagari, Bengali–Ass ...

model used by Unicode's existing Tamil implementation.

Keyboard drivers and fonts

The keyboard driver for this encoding scheme is available on the

Tamil Virtual Academy Tamil Virtual Academy, formerly known as the Tamil Virtual University, is a distance education institution based in Chennai, Tamil Nadu, India. The Government of Tamil Nadu established the Tamil Virtual University on 17 February 2001 as a so ...

website for free.Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts
It uses Tamil 99 and Tamil Typewriter

keyboard layout A keyboard layout is any specific physical, visual, or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. Standard keybo ...

s, which are approved by the

Government of Tamil Nadu The Government of Tamil Nadu () is the administrative body responsible for the governance of the Indian state of Tamil Nadu. Chennai is the capital of the state and houses the state executive, legislature and head of judiciary. Under the Const ...

, and maps the input keystrokes to its corresponding characters of the TACE16 scheme. To read files created using TACE16, the corresponding Unicode Tamil fonts are also available on the same website. These fonts map glyphs for characters of TACE16 format, but also for the

Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes. Typically, proposals such as the ...

for both

ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...

and Tamil characters, so that they can provide

backward compatibility In telecommunications and computing, backward compatibility (or backwards compatibility) is a property of an operating system, software, real-world product, or technology that allows for interoperability with an older legacy system, or with Input ...

for reading existing files which are created using the Tamil Unicode block.

Character set

All the characters of this encoding scheme are located in the

private use area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three Private Use Areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearly covering ...

of the

Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

Universal Coded Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of character (computing), characters defined by the international standard International Organization for Standardization, ISO/International Electrotechnical Commission, IEC ...

Comparison of TACE16 to present Tamil Unicode

Criticism of the standard Unicode character model for Tamil

The existing Unicode character model for Tamil is, like most of Indic Unicode, an

abugida An abugida (; from Geʽez: , )sometimes also called alphasyllabary, neosyllabary, or pseudo-alphabetis a segmental Writing systems#Segmental writing system, writing system in which consonant–vowel sequences are written as units; each unit ...

-based model derived from

. It been criticized for several reasons. Unicode represents only 31 Tamil base characters as single

code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...

s, out of 247

grapheme cluster The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, official ...

s. These include stand-alone vowels, and 23 basic consonant glyphs (which, due to not bearing a

virama Virama ( ्, ) is a Sanskrit phonological concept to suppress the inherent vowel that otherwise occurs with every consonant letter, commonly used as a generic term for a codepoint in Unicode, representing either # halanta, hasanta or explicit vir ...

, nonetheless denote a syllable with both a consonant and a vowel when used on their own). The others are represented as sequences of code points, requiring software support for advanced typography features (such as

Apple Advanced Typography Apple Advanced Typography (AAT) is Apple Inc.'s computer technology for advanced font rendering, supporting internationalization and complex features for typographers, a successor to Apple's little-used QuickDraw GX font technology of the mid ...

Graphite Graphite () is a Crystallinity, crystalline allotrope (form) of the element carbon. It consists of many stacked Layered materials, layers of graphene, typically in excess of hundreds of layers. Graphite occurs naturally and is the most stable ...

, or OpenType advanced typography) to render correctly. This also requires the use of invisible

zero-width joiner The zero-width joiner (ZWJ, ; rendered: ; HTML entity: or ) is a non-printing character used in the computerized typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex ...

and

zero-width non-joiner The zero-width non-joiner (ZWNJ, ; rendered: ; HTML entity: or ) is a non-printing character used in the computerization of writing systems that make use of Typographic ligature, ligatures. For example, in writing systems that feature initial, ...

characters in places where the desired grapheme cluster would otherwise be ambiguous. This complexity can result in security vulnerabilities and ambiguous combinations, can require the use of an exception table to forbid invalid combinations of code points, and can necessitate the use of string normalization to compare two strings for equality. Additionally, since syllables with both a consonant and a vowel form 64 to 70% of Tamil text, an abugida-based model which encodes the consonant and vowel parts as separate code points is inefficient, in terms of how long a string needs to be to contain a given piece of text, in comparison with a syllabary-based model. Furthermore, ISCII is primarily an encoding of

Devanagari Devanagari ( ; in script: , , ) is an Indic script used in the Indian subcontinent. It is a left-to-right abugida (a type of segmental Writing systems#Segmental systems: alphabets, writing system), based on the ancient ''Brāhmī script, Brā ...

, and the ISCII encodings of other

Brahmic scripts The Brahmic scripts, also known as Indic scripts, are a family of abugida writing systems. They are used throughout South Asia, Southeast Asia and parts of East Asia. They are descended from the Brahmi script of ancient India and are used b ...

(including Tamil) encode characters over the code points of the corresponding characters in Devanagari ISCII. Although Unicode encodes the Brahmic scripts separately from one another, the Tamil block mirrors the ISCII layout (with Devanagari-style character ordering, and reserved space in positions corresponding to Devanagari characters with no Tamil equivalent); consequently, the characters are not in the natural sequence order, and strings

collated Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fil ...

by code point (analogous to " ASCIIbetical" sorting of English text) will not produce the expected sorting order. It requires a complex collation algorithm for arranging them in the natural order.

TACE16 in comparison

The following data provides a comparison of current Unicode Tamil vs. TACE16 on e-governance and browsing: * TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent for

data storage Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are con ...

. * TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent for sorting index data. * TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is Tamil. The default collation sequence followed (binary) while using the code-space values in TACE16 is not as per Tamil dictionary order. * TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent. * Index creation on TACE16 data is faster by 36.7% than Unicode. * For full key search on indexed fields, TACE16 performs better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields, TACE16 performs better than Unicode Tamil by up to 20.9%. * Rendering of static Tamil data works with TACE16. TACE16 provides performance improvements in processing time and processing space. It encompasses all of the general Tamil text; it is sequential; and it is unambiguous, with any point corresponding to only one character. The TACE16 system takes fewer instruction cycles than Unicode Tamil, and also allows programming based on Tamil grammar, which needs extra framework development in Unicode Tamil.

Responses by the Unicode Consortium

The

Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...

publishes a dedicated

FAQ A frequently asked questions (FAQ) list is often used in articles, websites, email lists, and online forums where common questions tend to recur, for example through posts or queries by new users related to common knowledge gaps. The purpose of a ...

page on the Tamil script which responds to some of the criticisms. In defence of the ISCII model, the Consortium notes that expert

linguist Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

typographer Typography is the art and technique of Typesetting, arranging type to make written language legibility, legible, readability, readable and beauty, appealing when displayed. The arrangement of type involves selecting typefaces, Point (typogra ...

s and programmers were involved in its development, but acknowledges that compromises were made due to ISCII being constrained to single-byte

extended ASCII Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...

. The Consortium points out that Unicode Tamil is now implemented by all major

operating system An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...

s and

web browser A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...

s, and maintains that it should be used in open interchange contexts, such as online, since tools such as

search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

s would not necessarily be able to identify or interpret a sequence of Unicode private-use code points as Tamil text. However, the Consortium does not object to the use of Private-Use Area schemes, including TACE16, internally to particular processes for which they are useful. In particular, it highlights that both markup schemes and alternative encoding schemes may be used by researchers for specialised purposes such as natural-language processing. Unicode defines normative named-sequences for all Tamil pure consonants and syllables which are represented with sequences of more than one code point, and a dedicated table is published as part of the Unicode Standard listing all of these sequences, in their traditional order, along with their correct glyphs. The Consortium points out that it has been open to accepting proposals for characters for which existing Unicode representation exists: for example, adding several historical fractions and other symbols as the

Tamil Supplement Tamil Supplement is a Unicode block containing Tamil Tamil may refer to: People, culture and language * Tamils, an ethno-linguistic group native to India, Sri Lanka, and some other parts of Asia **Sri Lankan Tamils, Tamil people native to Sri ...

block in version 12.0 in 2019. Regarding collation, the Consortium argues that obtaining the correct result from sorting by code point is the exception rather than the rule, highlighting that, in unmodified ASCIIbetical ordering, the uppercase Latin letter ''Z'' sorts before the lowercase letter ''a'', and also highlighting that collation rules often differ by language (see e.g. ö). Regarding space efficiency, the Consortium argues that storage space and bandwidth taken up by text is usually far overshadowed by other accompanying media such as images and video, and that text content performs well under general-purpose compression methods such as Deflate (originally from the

ZIP file format ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is t ...

, standardized in RFC 1951 and integrated in the HTTP protocol as a generic encoding scheme).

Unicode Stability Policy

When first published (version 1.0.0), Unicode made only limited stability guarantees. As such, the original Tibetan block was deleted in version 1.0.1 (and its space has since been occupied by the Myanmar block), and the original block for Korean syllables was deleted in version 2.0 (and is now occupied by

CJK Unified Ideographs Extension A __FORCETOC__ CJK Unified Ideographs Extension-A is a Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for adminis ...

). Both the current Hangul Syllables block for Korean syllables, and the current Tibetan block, date back to Unicode 2.0. This was done on the assumption that little or no existing content using Unicode for those writing systems existed, since it would break compatibility with all existing Unicode content in, and

input method An input method (or input method editor, commonly abbreviated IME) is an operating system component or program that enables users to generate characters not natively available on their input devices by using sequences of characters (or mouse oper ...

s for, those writing systems. After this so-dubbed "Korean mess", the responsible committees pledged not to make such a compatibility-breaking change ever again, which now forms part of the Unicode Stability Policy. This stability policy has been upheld ever since, in spite of demands to re-encode or change the character model for both Tibetan and Korean a second time, made by

China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...

and

North Korea North Korea, officially the Democratic People's Republic of Korea (DPRK), is a country in East Asia. It constitutes the northern half of the Korea, Korean Peninsula and borders China and Russia to the north at the Yalu River, Yalu (Amnok) an ...

respectively. Likewise in relation to Tamil, the Consortium emphasises the "crucial issue of maintaining the stability of the standard for existing implementations", and argues that "the resulting costs and impact of destabilizing the standard" would substantially outweigh any efficiency benefits in processing speed or storage space. There was a proposal to re-encode Tamil that was rejected by Unicode, who said that the re-encoding would be damaging and that there was no convincing evidence that Unicode Tamil encoding is deficient.

Alternatives

Open-Tamil

The Open-Tamil project provides many of the common operations. It claims Level-1 compliance of Tamil text processing without using TACE16, but is written on top of extra programming logic which is needed for Unicode Tamil.

Footnotes

References

{{Character encoding Tamil character-encoding standards Character sets