The following is a DRAFT of a proposal for comments
by the internet tamil community. Early in 1998, this draft will
be finalised and the proposal presented to the Tamilnadu Computer Standardisation
Committee (TNC) for possible adoption as a Standard.
A Proposal for
A Tamil Standard Code For Information Interchange
(TSCII)
Tamil
Tamil is one of the two classical languages
of India. It is the only language in that country which has continued to
exist for over two thousand years. It is spoken today by approximately
65 million people living mainly in southern India, Sri Lanka, Singapore,
Malaysia, Africa, Fiji, the West Indies, Mauritius and Reunion Islands,
United Kingdom, United States and Canada. Tamil is the pre-eminent member
of the Dravidian Language family and has one of the longest unbroken literary
traditions of any living language in the world. [1]
Information Processing
in Tamil
Dravidian Languages such as Tamil use non-roman letters as alphabets. Hence
typing of text materials in computers of these Indic languages requires
use of either specific font-faces and/or word-processing software. In spite
of this limitation, word-processing of tamil text materials on computers
has been taken place well over a decade. Many different fonts and packages
have been developed. With the availability of free tamil fonts in the internet
during the last two years, there has been a phenomenal growth in the number
of web sites dealing with matters of interest to Tamils at large. There
are already a number of tamil language newspapers and many popular, literary
magazines available "On-line" in tamil script. There are web sites devoted
to collection of electronic texts of tamil literary classics, language
learning etc. in tamil script. A comprehensive
listing of over 350 web sites of interest to Tamils is also available
on the internet [2]. Currently nearly all of the tamil computing is at
the word-processing level. We do not yet have dedicated softwares for other
applications such as those for databases and multimedia
In the absence of any organised effort
to co-ordinate and promote tamil information processing at the national
and international level, many different fonts and desktop publishing softwares
have been developed at different parts of the world. There was hardly any
standard protocols followed in the development of these key tools for tamil
information processing. This in turn has led to the present (rather unfortunate)
situation that, one needs to download and install several tamil fonts or
packages to be able to access most of the materials of interest to tamils
available on the internet. An International Conference devoted to Tamil
Information Processing, (TamilNet'97 ) was organised early this
year at Singapore by the Internet Resources and Development Unit (IRDU)
of the National University of Singapore to discuss the situation and propose
possible standards. This conference was unique as this was the first one
to address computing issues in Tamil. Some of the key
papers presented at this conference [3] (including one -a
broad overview of features of different fonts and DTP packages currently
in use [4] ) are available on the internet.
Recent Efforts
Towards Standardisation
Recently there have been three series
of efforts, all directed towards standardisation of the tamil information
processing on the internet. Firstly there have been a couple of national
and international conferences on this topic, including the TamilNet'97
mentioned earlier. Secondly, the Tamil Nadu Government recently set up
an expert committee /task force ("The Tamilnadu Computer Standardisation
Committee") to examine the situation and make recommendations. This committee
has made its first recommendations
for tamil keyboard layouts [5].
Tamil Language has been fortunate to
have two major email discussion lists on internet: one operated by Asia-Pacific
Internet Company (APIC) called tamilnet
[6] (tamil@tamil.net) and one operated by IRDU unit of the National
University of Singapore called TamilWeb
(tamilnet@tamilnews.org.sg) [7]. For over a year, tamil lovers from
different parts of the world have been discussing in these email lists,
the urgent need to have an international standard for tamil computing.
Participants for these discussions come from different walks of life (software
developers, academics at the universities and ordinary/simple end-users).
Recently the mailing lists were merged into a single one. The proposal
for a new international standard for tamil information interchange discussed
in these web pages is the outcome of these deliberations (exchange of several
hundred emails amongst several hundred participants over several months!!!).
Current Standards
for Tamil
Before we elaborate on the proposed
standard, it is pertinent to review the current standard if any. In early
eighties, the Dept. of Electronics of the Govt. of India set up an expert
committee to set up standards for information processing of indic languages.
The Indian Standard Code for Information Interchange (ISCII) first
launched in 1984 is the outcome of this exercise. The Indian Standard Code
ISCII is a 8-bit umbrella standard, defined in such a way that all indian
languages can be treated using one single character encoding scheme. ISCII
is a bilingual character encoding (not glyphs!) scheme. Roman characters
and punctuation marks as defined in the standard lower-ASCII take up the
first half the character set (first 128 slots). Characters for indic languages
are allocated to the upper slots (128-255). The Indian Standard ISCII-84
was subsequently revised in 1991 (ISCII-91). It is currently undergoing
revision, possibly leading to ISCII-97. Along with the character encoding
scheme (ISCII), the Govt. of India also defined a keyboard layout for input
called INSCRIPT. The research and development wing of the DOE, Govt
of India (called Center for
Development of Advanced Computing, CDAC based in Pune, India) has developed
software packages based on these indian standards. Multilingual and Multimedia
products are based on Graphics
and Intelligence-based Script Technology (GIST)(Email: gist@cdac.ernet.in).
Commercial DTP packages based on ISCII are also available .
UNICODE[8]
is a rapidly evolving international standard for multi-lingual word-processing.
Unicode is a more ambitious 16-bit character encoding scheme with defining
of over 65000 slots for 50+ world languages. Along with other indic languages,
Tamil
has been assigned specific slots U+0B80 -> U+0BFF (which, in decimal, is
2944 -> 3071; 128 locations) in this multi-lingual standard [9]. For
obvious reasons, the choice of characters in UNICODE for indic languages
is based on the indian standard code ISCII. Microsoft has already implemented
Unicode in its Windows 95/NT OS and even distributes a
unicode font free for multi-lingual word-processing[10]. These fonts
do NOT yet include any glyphs for the indic language segments. Apple has
released recently a
multi-lingual package for indian market based on ISCII [11] but this
package does NOT include, yet, the glyphs corresponding to Tamil.
Need for the
Proposed Standard for Tamil
If ISCII and UNICODE standards already
exist for information interchange of indic languages (including tamil),
a natural question is why propose another standard for tamil. Listed below
are some key arguments advanced in this context:
i) Both ISCII and UNICODE emphasise
character encoding and leave the screen rendering of these characters to
software developers. Dravidian languages are notorious for their complex
glyph structures. Practically all of the current implementations of the
ISCII and Unicode standards invoke modern font-handling techniques (such
as glyph substitution) that are available only in state-of-the-art computers
running under the latest versions of the OS (two of the most widely used
platforms - Windows and Macintosh). Consequently these DTP packages are
very expensive. A layman / simple tamil user is precluded from doing any
simple word-processing of tamil texts on earlier generation computers.
ii) The necessity to go for advanced
font handling techniques such as glyph substitution further puts us to
a disadvantage as we will have to wait for applications (DTP, Word Processing
etc.) to be developed from scratch for Tamil and we may not enjoy the luxury
of using off-the-shelf applications that were developed for English *as-is*
in Tamil.
iii) Using Devanagiri script as the
reference language, ISCII defines a certain encoding scheme for all indic
languages including the dravidian languages such as tamil, telugu and malayalam.
Many of the scholars of th dravidian languages are highly critical of this
approach. The phonology and the script usage of dravidian languages are
very different. There are many characters in Tamil and Malayalam for which
there are no equivalent devanagiri ones. Compromises are made by allocating
extra slots to introduce these additional characters. By treating all indian
scripts under one scheme, ISCII philosophy does not take advantage of the
fact that Tamil *can* be encoded in a simple form that seamlessly integrates
with existing computing platforms without requiring specialised rendering
technologies.
iii) ISCII and UNICODE are NOT the
only avenues open for tamil information interchange. It is worth pointing
out that these are evolving standards. But before their emergence, for
several decades, information processing and exchange in major languages
of the world has been going on and these are via usage of simple, self-standing
7- and 8-bit fonts. The only problem with these tamil fonts is that no
standard encoding scheme has been used. So, the exchange of tamil text
files is not simple and one needs to use converters to go from one scheme
to other. Web (read World-Wide-Web) based information exchange is fast
growing as the rapid, cost-effective means of data exchange across the
world. A standard encoding scheme for these tamil fonts can simplify the
exchange enormously. European languages, for example, have been fortunate
to have several character-encoding standards defined and universally implemented.
There are several advantages to develop
a tamil standard for information interchange that is based on simple, self-standing
fonts:
i) Once installed in the system, they
could be used practically on all applications directly without any extra
software/hardware intervention; ii) The development of fonts corresponding
to one encoding scheme can be easily implemented to other computer platforms
(particularly between Windows, Macintosh and Unix) and it is rather straight-forward.
The task is so routine and simple that, growing number of fonts are being
made available FREE on the internet even by the amateurs. iii) World-wide,
FREE Distribution of a self-standing tamil font will lead to vary rapid
standardisation of information interchange, as has been the case with most
of the european, Russian and Japanese languages. Up till recently (when
free tamil fonts appeared on the internet), tamil word-processing required
purchase of a tamil font for at least US$50 (much higher for DTP packages).
No language can flourish in the emerging computer era if the basic fonts
required for routine tasks are either come as part of the computer system
software.
Design Goals
of the Proposed Standard
1. Establish a consistent International
Tamil character encoding standard that in turn lead to a self-standing
Tamil font usable on all computer platforms particularly on earlier
models/operating systems (cover at least those that appeared within
the last decade).
-
A tamil font defined very much like
the roman font such as Times or Helvetica, once installed in the system,
can be used on all software packages supported by the respective OS without
the need for additional software/hardware intervention. It is likely that
over 90% of tamil computing is in the form of simple word-processing of
plain text. The encoding standard must be such as to be readily implemented
in most of the widely used computer platforms (UNIX, Windows and Mac).
The input of tamil materials will be in all these three platforms. On the
internet, the information exchange may involve any of the three OS (sender
could use a Windows PC, the recipient a Mac and the intermediate mail server
can be Unix-based).
-
Fortunately in the last three years,
procedures have been developed for production of fonts with identical encoding
scheme that work under these different platforms. Information exchange
via email and WWW has also been perfected that, no serious problems are
anticipated in rapid implementation of the proposed scheme on all three
OS. Tamilnadu Govt. is willing to undertake the task of producing one such
tamil font and distribute it free on internet. Free distribution of a handful
of such fonts will not deprive the software market. There will always be
a need for specially designed fonts for professional usage (in publishing
houses), very much the same way the font market still exists for roman
fonts (Adobe and others continue to make millions marketing roman fonts!)
2. The encoding is at the 8-bit
bilingual level, using a unique set of glyphs and the usual lower ASCII
set (roman letters with standard punctuation marks) occupying the first
128 slots.
-
Why a 8-bit bilingual scheme?
i) Almost all of the European languages
(representing several hundred million population!) currently employ such
8-bit bilingual scheme, commonly known as ISO 8859-X schemes. Such 8-bit
schemes are proven standards widely implemented by all major computer platforms.
So, in terms of identification and implementation, the scheme is rather
straightforward even for non-tamil speakers.
ii) A 8-bit scheme with lower ASCII
part in the first 128 slot can facilitate enormously the smooth flow of
information across the internet in all of the commonly used protocols (SMTP,
FTP, HTTP, NNTP, POP, IMAP,..) All non-tamil speaking personnel entrusted
with communication flow (postmasters, system administrators,..) can easily
follow the content, its originator, destination etc.
iii) Tamilnadu as a constituent
state of India works under a bilingual scenario with both English and Tamil
as the languages for official communications. With a single font it will
be possible to correspond in either or both of the languages. ISC-II standard
of the Govt. of India is also defined in a similar way.
-
What does it mean by a unique set
of glyphs?
Tamil has far too many alphabets
to be accommodated as a single glyph in the 128 slots left. So, depending
on the complexity of the character (and its rendering) the scheme may use
one, two or three bytes to define a single alphabet. But the choices of
glyphs are such that, each of the 250+ tamil alphabets (uyir, mei and uyirmeis)
are represented by one and only one way.
In the past, Tamil language used
alternative glyphs for some of the tamil alphabets (e.g forward kombu/kokki
to write lai/Nai/nai, Raa, Naa and Naa, referred to as ORNL). A unique
definition scheme implies that there is no place for these old style characters
in the encoding scheme.
What about character encoding
as in ISCII and UNICODE?
If the glyph encoding scheme is
UNAMBIGUOUS in defining the resulting character set, then it does not really
matter if one choose to encode glyphs or characters. Defining a unique
set of glyphs leading to a unique definition of all of the 250+ tamil characters
makes the glyph encoding scheme unambiguous. Defining glyphs also defines
the rendering part of the characters. The fact that we already have successful
functioning of several tamil fonts in the market is a clear proof on the
validity or implementation part of the approach. As mentioned under (1),
the glyph encoding scheme allow design of self-standing simple fonts. Defining
characters alone and leaving the rendering part to the software (as in
UNICODE and ISCII) require dedicated expensive softwares. Most of the rendering
methods currently employed in these schemes require modern font handling
techniques that are available only on state-of-the-art computers. Unicode
fonts and Apple Multi-lingual package (with Devanagiri) can be used only
on the latest generation computers with Power PC chips and current OS software
!!
3. The encoding scheme should be
universal in scope. That is, the standard must be include all characters
that are likely to be used in everyday Tamil text interchange.
-
For centuries tamil language has grown
with several grantha characters added on. The usage of these grantha characters
along with pure tamil ones is so deep-rooted in the day-to-day usage of
tamil by the common man. Hence the inclusion of these grantha characters
becomes essential under the above criterion. Both ISC-II and UNICODE recognise
this situation and have provided specific slots for a number of grantha
characters.
-
Unlike many of the tamil fonts and
software packages that leave out rarely used tamil alphabets (such as ngu,
ngU, nyu, nyuu), the present scheme ensures their presence. This has been
done so that multimedia and softwares for teaching tamil can display all
of the tamil alphabets without exception.
4. The encoding standard must be
UNICODE and ISCII compatible.
-
Why unicode compatibility?
The glyph choices are to be such
that, a one-to-one correspondence table between the alphabet/character
definitions under the present scheme and UNICODE / ISCII can be established.
A draft of one such mapping table is presented herewith in the annex section..
-
There are major advantages by ensuring
this requirement:
i) The end-user can have a choice
in the storage format to be either the present 8-bit scheme or the unicode/iscii
format. So files can be exchanged readily between users of these different
standards without loss of integrity of the file;
ii) Secondly the present glyph
encoding scheme can happily co-exist with the more sophisticated Unicode/ISCII
schemes and even can make way for smooth transition to unicode at a future
date. Indian language Packages for Unicode and ISCII are very expensive
and have started appearing in the market only very recently. It is still
largely under-explored domain for fool-proof implementation.
-
What does unicode compatibility
means in terms of glyph choices?
Both Unicode and ISCII scheme include
a number of tamil numerals. So the present scheme need to include these
tamil numerals. Else there cannot be a one-to-one correspondence between
these forthcoming standards.
5. The standard may include a private
use area, which may be used to assign codes to characters not included.
-
The encoding scheme should leave at
least 4-5 slots free for special use by software developers. None of the
standard softwares written specifically for tamil will use the characters
that are placed in these slots in "search" or "sort" type routines. However,
use of this *special* characters in archives and other digital libraries
is not encouraged so as to prevent mis-interpretation of their 'values'
or 'meanings'.
-
What are possible uses for such
private use area?
Several possible usages can be
envisaged for these private slots:
a) replacement of straight quotes
by the corresponding curly quotes, as is the current default case in most
of the Microsoft, Claris softwares for word-processing, graphics,...( vacant
slots #145-148 strongly recommended for this purpose)
b) diacritical markers for writing
transliterated/romanized form of tamil
c) old style tamil characters such
as lai/Nai/nai or Raa/Naa/naa.
d) "escape slots" through which
software developers can bring many special characters - such as those required
for recording/processing of old classical tamil texts still in palm-leaf
manuscript level.
6. The standard must be usable and
co-exist with other existing software until Unicode compliant software
becomes available.
-
One-to-one correspondence table in
the character definition as per the proposed standard with the popular
tamil fonts/DTP packages will ensure smooth transition and recovery of
all the archived tamil text materials produced till this date. There exist
already conversion softwares that allow inter conversion of tamil text
files prepared using different font encoding schemes. One such conversion
software based on the proposed tamil standard can be made available to
promote rapid and smooth transition to the new standard.
7. The tamil standard must be in
public domain.
-
The character encoding standard will
have no restrictions on its use. It can be used freely for both commercial
and private purposes.
Practically all of the tamil fonts
and softwares that are currently in use world-wide are the recent work
of individual authors and hence are subject to copyright protection to
the authors. The copyright protection to authors is very clear with DTP
packages. But when it comes to fonts, the scope (what can be subject to
copyright and what is not) is very hazy and protection vary from country
to country. So it is desirable to develop a true international standard
that will be labelled clearly to be in public domain.
-
The proposed "encoding" is in public
domain - i.e. no one needs to seek permission or state credit to implement
a font based on the encoding. But the "implemented font" may or may not
be copyrighted by the developer - this is entirely the developers descretion.
8. The tamil encoding standard should
be such as to allow rapid implementation of many of the routine
tasks required in large databases (such as search or sort).
-
It is very likely that with the widespread
growth of a true international standard for tamil, large databases (library
catalogues, electronic telephone directory, land/property registry, inventory
of materials in departmental stores etc. etc.) are built based on tamil
script. Routine usage of these databases often require search or sort routines.
The encoding scheme should be such as to allow development of softwares
for these without unnecessary demand for huge computer memory or processing
capacity.
9. The encoding standard should be
such as to meet special requirements of various types of applications.
-
What about the special needs of
publishing houses that require a high quality output. Can a glyph encoding
guarantee this?
As mentioned earlier, with 128
available slots it is not possible to keep all of the 250+ tamil characters
as single unique glyphs. If we use the frequency of occurrence of various
tamil alphabets in a typical tamil text as a guide, with the proposed glyph
choices, nearly 90% of the text will consist of tamil characters appearing
as full glyphs without invoking of kerning or other font handling techniques!
-
Kerning is a routine font handling
technique now available in most of the common computer platforms/OS. The
present scheme envisages using kerning techniques to generate only two
series of uyirmeis (ikara and iikara varisai). As a right-end modifier,
the ikara and iikara varisai uyirmeis can be rendered fairly precise on
all platforms. So it is likely that over 98% of the tamil characters can
be rendered easily on screen and in print without loss of quality. Techniques
such as pair-wise kerning can handle even the residuals adequately.
-
Also, sophisticated software packages
for professional publishing houses (that use high end computers ) can always
invoke font/glyph substitution and bring in a single glyph for the required
character in question rather than using a composite. Provisions have been
made with escape slots for this purpose.
-
What about the display in Point-of-Sale
(POS) terminals?
Excessive invoking of kerning can
pose problems for character display in POS terminals. As stated above,
the kerning is invoked in only two series (ikara, iikara varisai). Even
in these case the kokki falls apart in primitive POS systems, the tamil
text should be still readable. Since the screen is constantly re-written,
there should not be any problem to display all characters. Even in the
character only terminals certain characters cannot be rendered legibly
(Na, ha, ksha, sa may not fit in the usual 8 x 12 cell).
10. The output of the tamil standard
(tamil text) should be independent of the input mode.
-
There are several popular methods of
input for tamil and these are considered under different keyboard layouts:
classical tamil typewriter, romanized and phonetic or transliterated. Several
Keyboard editors that allow input according to these different methods
have already been developed and these can be readily adapted to include
the proposed encoding scheme as the reference chart for the font in question.
11. As with the Unicode standard, "the
proposed standard does not encode idiosyncratic, personal, novel, rarely
exchanged, or private-use characters, nor does it encode logos or graphics.
Artificial entities, whose sole function is to serve transiently in the
input of text, are excluded. Graphologies unrelated to text, such as musical
and dance notations, are outside the scope of the .. standard.
-
One possibility would be to agree on
a supplementary ding-bat type font for exclusive usage amongst the tamil
community - one that contains symbols such as OM, religious symbols, arrows,
greek symbols etc. If the all tamil web pages use these two (one official
tamil font and a second de-facto standard dingbat style font ), we can
easily add some color and liveliness to the world of tamil computing.
Proposed Standard
Code for Tamil
The proposed tamil standard is a 8-bit
bilingual scheme with the standard roman characters and punctuation marks
in the first 128 slots, as in lower ASCII chart). The tamil glyphs along
with a handful of grantha characters and special characters are placed
in the upper-ASCII part (slot positions 128-255). Table
1 (see annex-1) presents a complete listing of various glyph choices
and with their code assignments. The same information is also presented
in the form of a compact .gif file Figure
1 . Motivation for the specific glyph choices and the slot allocations
are elaborated in the next section ( design goals for the proposed standard).
Glyph choices for Slot
positions 0-127 /rows 0-7:
Roman characters and punctuation marks
- glyph choices identical to those in standard lower ASCII code / 8859-1
(Latin-1) schemes
Glyph choices for Slot positions
128-255 /rows 8-15:
i) entire set of vowels (uyirs) (18)
and consonants (meis) (14)
ii) entire set of akaram-eRRiya meis
(18), ikara varisai (18) and ukara varisai (18) and uukara varisai (18)
iii) consonant-modifiers for aakara,
ikara, iikara, ekara and Ekara varisai (5); consonant-modifiers for the
ukara and uukara varisai for the grantha characters (2)
iv) grantha characters ja, sha, sa,
ha, ksha ( 5 vowel form and also the corresponding akara varisai (5))
v) special characters: copyright sign
(#169), registered sign (#174) and bullet (#183) at their respective ANSI
code positions shown within parenthesis. (Most of the punctuation marks
required for presentation of newspapers and magazines on WWW are available
in the standard lower ASCII set. Two missing ones are the copyright and
registered mark signs. We included them here at their respective ANSI code
positions to avoid the need to invoke additional font face tags in the
HTML files.)
Notes on Implementation
of 8-bit encoding scheme
What are the procedures to be followed
in design of fonts so that there is no loss of integrity of files when
they are transported across different computer platforms?
Amongst the character sets defined
for world languages, the most extensively used standards are those corresponding
to ISO 8859-X schemes, X=1,2,3,4, 5 (these are also known as Latin-1, Latin-2,
Latin-3, Latin-4 and Latin-5 schemes). Due to fears of bit-stripping, 8859-X
schemes do not use the rows 8 and 9. Till early nineties, bit-stripping
was a common problem. (When bit-stripping occurs a character at slot #163
gets replaced by the one at #131 (=163 - 32). Now most of the communication
protocols (particularly the email (SMTP) and the Web (HTML on Netscape
or Internet Explorer) ) fully implement 8-bit level transfers.
However, it is essential that the
proper precaution is taken in the definition of the encoding of the character
set. There are many 8-bit encoding standards implemented today on computers
(OS). Mention can be made of the following: i) ANSI
standard having characters in all of the 16 rows; ii) Macintosh
(MacRoman) encoding, having characters in all of the 16 rows; iii)
Windows encoding ...... and iv) Adobe Standard Encoding. It is strongly
recommended that the character names chosen during the packing correspond
to full ANSI standard (or that of MacRoman ??).
For smooth handling of text files
created using the proposed encoding scheme, it is desirable to register
the proposed standard code-set TSCII as an international ISO standard.
The commonly used web browsers such as Netscape/Internet Explorer recognize
and handle many types of character sets and they can be persuaded to include
the TSCII code set as well. This would also avoid the necessity to choose
"personal encoding" to read the tamil text files on these browsers.
Continued in Part
II- inclusive of Annexes (carrying the technical details on the proposed
standard and guidelines in implementation.
Click here to
go to the Webpage carrying the Annexes.
This file was last revised on 2
Dec. 1997.
Please send your comments to Dr.
K. Kalyanasundaram