The following is a DRAFT of a proposal for comments by the internet tamil community. Early in 1998, this draft will be finalised and the proposal presented to the Tamilnadu Computer Standardisation Committee (TNC) for possible adoption as a Standard.


A Proposal for 
A Tamil Standard Code For Information Interchange
(TSCII)



Tamil

Tamil is one of the two classical languages of India. It is the only language in that country which has continued to exist for over two thousand years. It is spoken today by approximately 65 million people living mainly in southern India, Sri Lanka, Singapore, Malaysia, Africa, Fiji, the West Indies, Mauritius and Reunion Islands, United Kingdom, United States and Canada. Tamil is the pre-eminent member of the Dravidian Language family and has one of the longest unbroken literary traditions of any living language in the world. [1]

Information Processing in Tamil

Dravidian Languages such as Tamil use non-roman letters as alphabets. Hence typing of text materials in computers of these Indic languages requires use of either specific font-faces and/or word-processing software. In spite of this limitation, word-processing of tamil text materials on computers has been taken place well over a decade. Many different fonts and packages have been developed. With the availability of free tamil fonts in the internet during the last two years, there has been a phenomenal growth in the number of web sites dealing with matters of interest to Tamils at large. There are already a number of tamil language newspapers and many popular, literary magazines available "On-line" in tamil script. There are web sites devoted to collection of electronic texts of tamil literary classics, language learning etc. in tamil script. A comprehensive listing of over 350 web sites of interest to Tamils is also available on the internet [2]. Currently nearly all of the tamil computing is at the word-processing level. We do not yet have dedicated softwares for other applications such as those for databases and multimedia
In the absence of any organised effort to co-ordinate and promote tamil information processing at the national and international level, many different fonts and desktop publishing softwares have been developed at different parts of the world. There was hardly any standard protocols followed in the development of these key tools for tamil information processing. This in turn has led to the present (rather unfortunate) situation that, one needs to download and install several tamil fonts or packages to be able to access most of the materials of interest to tamils available on the internet. An International Conference devoted to Tamil Information Processing, (TamilNet'97 ) was organised early this year at Singapore by the Internet Resources and Development Unit (IRDU) of the National University of Singapore to discuss the situation and propose possible standards. This conference was unique as this was the first one to address computing issues in Tamil. Some of the key papers presented at this conference [3] (including one -a broad overview of features of different fonts and DTP packages currently in use [4] ) are available on the internet.

Recent Efforts Towards Standardisation

Recently there have been three series of efforts, all directed towards standardisation of the tamil information processing on the internet. Firstly there have been a couple of national and international conferences on this topic, including the TamilNet'97 mentioned earlier. Secondly, the Tamil Nadu Government recently set up an expert committee /task force ("The Tamilnadu Computer Standardisation Committee") to examine the situation and make recommendations. This committee has made its first recommendations for tamil keyboard layouts [5].
Tamil Language has been fortunate to have two major email discussion lists on internet: one operated by Asia-Pacific Internet Company (APIC) called tamilnet [6] (tamil@tamil.net) and one operated by IRDU unit of the National University of Singapore called TamilWeb (tamilnet@tamilnews.org.sg) [7]. For over a year, tamil lovers from different parts of the world have been discussing in these email lists, the urgent need to have an international standard for tamil computing. Participants for these discussions come from different walks of life (software developers, academics at the universities and ordinary/simple end-users). Recently the mailing lists were merged into a single one. The proposal for a new international standard for tamil information interchange discussed in these web pages is the outcome of these deliberations (exchange of several hundred emails amongst several hundred participants over several months!!!).

Current Standards for Tamil

Before we elaborate on the proposed standard, it is pertinent to review the current standard if any. In early eighties, the Dept. of Electronics of the Govt. of India set up an expert committee to set up standards for information processing of indic languages. The Indian Standard Code for Information Interchange (ISCII) first launched in 1984 is the outcome of this exercise. The Indian Standard Code ISCII is a 8-bit umbrella standard, defined in such a way that all indian languages can be treated using one single character encoding scheme. ISCII is a bilingual character encoding (not glyphs!) scheme. Roman characters and punctuation marks as defined in the standard lower-ASCII take up the first half the character set (first 128 slots). Characters for indic languages are allocated to the upper slots (128-255). The Indian Standard ISCII-84 was subsequently revised in 1991 (ISCII-91). It is currently undergoing revision, possibly leading to ISCII-97. Along with the character encoding scheme (ISCII), the Govt. of India also defined a keyboard layout for input called INSCRIPT. The research and development wing of the DOE, Govt of India (called Center for Development of Advanced Computing, CDAC based in Pune, India) has developed software packages based on these indian standards. Multilingual and Multimedia products are based on Graphics and Intelligence-based Script Technology (GIST)(Email: gist@cdac.ernet.in). Commercial DTP packages based on ISCII are also available .
UNICODE[8] is a rapidly evolving international standard for multi-lingual word-processing. Unicode is a more ambitious 16-bit character encoding scheme with defining of over 65000 slots for 50+ world languages. Along with other indic languages, Tamil has been assigned specific slots U+0B80 -> U+0BFF (which, in decimal, is 2944 -> 3071; 128 locations) in this multi-lingual standard [9]. For obvious reasons, the choice of characters in UNICODE for indic languages is based on the indian standard code ISCII. Microsoft has already implemented Unicode in its Windows 95/NT OS and even distributes a unicode font free for multi-lingual word-processing[10]. These fonts do NOT yet include any glyphs for the indic language segments. Apple has released recently a multi-lingual package for indian market based on ISCII [11] but this package does NOT include, yet, the glyphs corresponding to Tamil.

Need for the Proposed Standard for Tamil

If ISCII and UNICODE standards already exist for information interchange of indic languages (including tamil), a natural question is why propose another standard for tamil. Listed below are some key arguments advanced in this context:
i) Both ISCII and UNICODE emphasise character encoding and leave the screen rendering of these characters to software developers. Dravidian languages are notorious for their complex glyph structures. Practically all of the current implementations of the ISCII and Unicode standards invoke modern font-handling techniques (such as glyph substitution) that are available only in state-of-the-art computers running under the latest versions of the OS (two of the most widely used platforms - Windows and Macintosh). Consequently these DTP packages are very expensive. A layman / simple tamil user is precluded from doing any simple word-processing of tamil texts on earlier generation computers.
ii) The necessity to go for advanced font handling techniques such as glyph substitution further puts us to a disadvantage as we will have to wait for applications (DTP, Word Processing etc.) to be developed from scratch for Tamil and we may not enjoy the luxury of using off-the-shelf applications that were developed for English *as-is* in Tamil.
iii) Using Devanagiri script as the reference language, ISCII defines a certain encoding scheme for all indic languages including the dravidian languages such as tamil, telugu and malayalam. Many of the scholars of th dravidian languages are highly critical of this approach. The phonology and the script usage of dravidian languages are very different. There are many characters in Tamil and Malayalam for which there are no equivalent devanagiri ones. Compromises are made by allocating extra slots to introduce these additional characters. By treating all indian scripts under one scheme, ISCII philosophy does not take advantage of the fact that Tamil *can* be encoded in a simple form that seamlessly integrates with existing computing platforms without requiring specialised rendering technologies.
iii) ISCII and UNICODE are NOT the only avenues open for tamil information interchange. It is worth pointing out that these are evolving standards. But before their emergence, for several decades, information processing and exchange in major languages of the world has been going on and these are via usage of simple, self-standing 7- and 8-bit fonts. The only problem with these tamil fonts is that no standard encoding scheme has been used. So, the exchange of tamil text files is not simple and one needs to use converters to go from one scheme to other. Web (read World-Wide-Web) based information exchange is fast growing as the rapid, cost-effective means of data exchange across the world. A standard encoding scheme for these tamil fonts can simplify the exchange enormously. European languages, for example, have been fortunate to have several character-encoding standards defined and universally implemented.
There are several advantages to develop a tamil standard for information interchange that is based on simple, self-standing fonts:
i) Once installed in the system, they could be used practically on all applications directly without any extra software/hardware intervention; ii) The development of fonts corresponding to one encoding scheme can be easily implemented to other computer platforms (particularly between Windows, Macintosh and Unix) and it is rather straight-forward. The task is so routine and simple that, growing number of fonts are being made available FREE on the internet even by the amateurs. iii) World-wide, FREE Distribution of a self-standing tamil font will lead to vary rapid standardisation of information interchange, as has been the case with most of the european, Russian and Japanese languages. Up till recently (when free tamil fonts appeared on the internet), tamil word-processing required purchase of a tamil font for at least US$50 (much higher for DTP packages). No language can flourish in the emerging computer era if the basic fonts required for routine tasks are either come as part of the computer system software.

Design Goals of the Proposed Standard

1. Establish a consistent International Tamil character encoding standard that in turn lead to a self-standing Tamil font usable on all computer platforms particularly on earlier models/operating systems (cover at least those that appeared within the last decade).
 

2. The encoding is at the 8-bit bilingual level, using a unique set of glyphs and the usual lower ASCII set (roman letters with standard punctuation marks) occupying the first 128 slots.

3. The encoding scheme should be universal in scope. That is, the standard must be include all characters that are likely to be used in everyday Tamil text interchange. 4. The encoding standard must be UNICODE and ISCII compatible. 5. The standard may include a private use area, which may be used to assign codes to characters not included. 6. The standard must be usable and co-exist with other existing software until Unicode compliant software becomes available. 7. The tamil standard must be in public domain. 8. The tamil encoding standard should be such as to allow rapid implementation of many of the routine tasks required in large databases (such as search or sort). 9. The encoding standard should be such as to meet special requirements of various types of applications. 10. The output of the tamil standard (tamil text) should be independent of the input mode. 11. As with the Unicode standard, "the proposed standard does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor does it encode logos or graphics. Artificial entities, whose sole function is to serve transiently in the input of text, are excluded. Graphologies unrelated to text, such as musical and dance notations, are outside the scope of the .. standard.

Proposed Standard Code for Tamil

The proposed tamil standard is a 8-bit bilingual scheme with the standard roman characters and punctuation marks in the first 128 slots, as in lower ASCII chart). The tamil glyphs along with a handful of grantha characters and special characters are placed in the upper-ASCII part (slot positions 128-255). Table 1 (see annex-1) presents a complete listing of various glyph choices and with their code assignments. The same information is also presented in the form of a compact .gif file Figure 1 . Motivation for the specific glyph choices and the slot allocations are elaborated in the next section ( design goals for the proposed standard).

 Glyph choices for Slot positions 0-127 /rows 0-7:

Roman characters and punctuation marks - glyph choices identical to those in standard lower ASCII code / 8859-1 (Latin-1) schemes

Glyph choices for Slot positions 128-255 /rows 8-15:
i) entire set of vowels (uyirs) (18) and consonants (meis) (14)
ii) entire set of akaram-eRRiya meis (18), ikara varisai (18) and ukara varisai (18) and uukara varisai (18)
iii) consonant-modifiers for aakara, ikara, iikara, ekara and Ekara varisai (5); consonant-modifiers for the ukara and uukara varisai for the grantha characters (2)
iv) grantha characters ja, sha, sa, ha, ksha ( 5 vowel form and also the corresponding akara varisai (5))
v) special characters: copyright sign (#169), registered sign (#174) and bullet (#183) at their respective ANSI code positions shown within parenthesis. (Most of the punctuation marks required for presentation of newspapers and magazines on WWW are available in the standard lower ASCII set. Two missing ones are the copyright and registered mark signs. We included them here at their respective ANSI code positions to avoid the need to invoke additional font face tags in the HTML files.)

Notes on Implementation of 8-bit encoding scheme

What are the procedures to be followed in design of fonts so that there is no loss of integrity of files when they are transported across different computer platforms?
Amongst the character sets defined for world languages, the most extensively used standards are those corresponding to ISO 8859-X schemes, X=1,2,3,4, 5 (these are also known as Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5 schemes). Due to fears of bit-stripping, 8859-X schemes do not use the rows 8 and 9. Till early nineties, bit-stripping was a common problem. (When bit-stripping occurs a character at slot #163 gets replaced by the one at #131 (=163 - 32). Now most of the communication protocols (particularly the email (SMTP) and the Web (HTML on Netscape or Internet Explorer) ) fully implement 8-bit level transfers.
However, it is essential that the proper precaution is taken in the definition of the encoding of the character set. There are many 8-bit encoding standards implemented today on computers (OS). Mention can be made of the following: i) ANSI standard having characters in all of the 16 rows; ii) Macintosh (MacRoman) encoding, having characters in all of the 16 rows; iii) Windows encoding ...... and iv) Adobe Standard Encoding. It is strongly recommended that the character names chosen during the packing correspond to full ANSI standard (or that of MacRoman ??).
For smooth handling of text files created using the proposed encoding scheme, it is desirable to register the proposed standard code-set TSCII as an international ISO standard. The commonly used web browsers such as Netscape/Internet Explorer recognize and handle many types of character sets and they can be persuaded to include the TSCII code set as well. This would also avoid the necessity to choose "personal encoding" to read the tamil text files on these browsers. 
Continued in Part II- inclusive of Annexes (carrying the technical details on the proposed standard and guidelines in implementation. 
Click here to go to the Webpage carrying the Annexes.

This file was last revised on 2 Dec. 1997.
Please send your comments to Dr. K. Kalyanasundaram