 |
A Proposal for a True Internationalization
by Mathias Hartwig, in Editorials - Sat, Jun 15th 2002 00:00 UTC
7bit character streams are the most secure against
misinterpretation. When I send email, I leave everything else out
(though we Germans have other symbols, too), just in case there is a
machine that cannot handle it. It has become a whole lifestyle; to
write everything in lower-case letters and put a smiley on the end of
the line seems to express global thinking. But if we are honest, we
know it expresses nothing but a deficiency in modern character
processing. This heritage of the 70s (the start of Unix system
distribution) is a hard hurdle to overcome. Fortunately, the need for
usability is getting stronger and the pride of programmers and
administrators is getting weaker. As a student of the Japanese
language, I went trough many sleepless nights setting up user
variables, input parsers, and terminal stuff, so I think I know the
difficulties. With this article, I will try to express a proposal
coming from both sides in me, the programmer and the user.
Copyright notice: All reader-contributed material on freshmeat.net
is the property and responsibility of its author; for reprint rights, please contact the author
directly.
Today, it is quite common that people from different countries are
working together. Also, though every employee may have his or her own
terminal, it's likely that common applications are provided by a
single file server. This creates a need for multi-language operating
systems. It sounds critical, but it isn't (yet). The commands are
exactly the same (Chinese sysadmins also type "mount", though I can
imagine language-dependent symlinks) and answers may vary only a
little (e.g., "y/n" in German is "j/n"). Until now, all other tasks of
internationalization have been avoided by the system, and applications
have to take up the slack. If they don't (if, for expample, the
administrator forgot to install the appropriate .po files), you can be
lost on a terminal with pictograms that, to you, mean nothing.
The i18n movement which started some years ago solves a lot, but not
everything. With it, only output is guaranteed to match the best
gettext will find. What about the input? Multibyte strings, produced
by input parsers like kinput2 or ami in an 8bit or 7bit environment,
are hard to handle and crack easily (if you press the delete button,
it removes only half a sign). kinput2 and ami cannot run together in
one terminal, because code pages intersect. Start and end sequences
are one solution, but a bad one and one especially not meant for the
long run. Imagine a document full of different languages; if I want a
function that gives a line length for this doc, it will be the hell,
and I haven't even mentioned what will happen when new languages with
new start and end sequences are implemented.
Also, we have so many applications which handle text and
formatting. Integration of multiple language parsers into them may
take 5 times more than implementing the problem-specific algorithms.
I think something like Microsoft's IME, a central (system-wide)
solution, is needed here. Unfortunately, IME is not Open Source, and
is therefore un(sup)portable.
Next problem: Character encoding. Oops, this discussion is as old as
computers are. Every nation had its own coding scheme, using the same
domains! What a crappy idea! How could somebody let this happen?! OK,
you say we have Unicode. Unicode was a good idea, until they found
that 16bit is too little. Also, look at Yudit's encoding list; there's
not one single Unicode, but many: UTF-7, UTF-8, UTF-16, etc.
Furthermore, Unicode text files have a starting sequence, and Windows
saves Unicode with low-hi byte order, but Posix systems don't. Java
uses wide characters (16bit) internally. Wow! Now it means nothing.
16bit is just too little; it was only for the short run.
Next problem: The console. Its fonts and behavior differ from those of
X (which, ten years after the invention of TrueType fonts, still lacks
correct handling; take a look at Abiword, and you will understand).
If I were Chinese, I would want to also see Chinese on my console, but
this is even harder than under X, not to mention input routines. But
what's the difference for an input parser between X and the console?
Next problem: Somebody better stop me from complaining. We have to
move on. We still use the old stuff, but are now saving in XML. This
is not very revolutionary. I will try to take a step forward. I'd
like to present a solution. It's time to think about an all-inclusive,
simple, and working system design.
But first, again, a collection of the problems mentioned above:
- Machines are 7bit- or 8bit-oriented, and the input is, too.
This is historical, but we have to overcome the compatibility
paranoia, or there will be no progression.
- Code pages intersect, and we have to improve the Unicode scheme.
- We want input parsers on the operating system level,
at a central place, which serve both X and the console.
Also, a basic font that holds all characters for both X and
the console is wanted (I don't like those question marks).
- We want neither start nor end sequences.
- We want user realtime language switching for input (and
maybe for output, too).
Fortunately, there are now these advantages which we can use:
- We process addresses and integers as 32bit, we have 32bit
buses, and next generation computers will have full 64bit
architectures.
- Applications do not contact keyboard codes; they already
get the values through the system (nearly unprocessed,
but at least a keymap is used).
- Computers are quicker than ever, giving us enough time to parse
the input correctly.
- So many routines for combining characters have already been
written.
- We have enough memory for the BFF (Big Fucking Font).
- We have enough hard drive space for the text files.
Especially when I think about points 5 and 6 of the advantages, I say:
Why bother? Let's give it a try. What I propose is first a new char
type that consists of 32bit lengths. This will give us the security
that in the future, no characters of any language will be
outlawed. The most low-level routines (that write to the buses) will
have to be changed. Upper-level APIs may stay the same (as user
programs do), as long as they do not play with overflow (255 + 2 = 1)
calculation. And, for heaven's sake, I propose to use only 7bit of the
4 bytes. Still, we would have around 270 million signs available. You
might say "That's way too much; 3 bytes, like for my display, is
enough!" Well, there are sound cards that process 24bit, but the
processor has to pack it into 32bit packages to enhance speed, so in
the end, there's no real advantage to 24bit. Also, in another 100
years, there might be the need for more. Please throw away the idea
that you will see 4 bytes when you open a terminal! A char (you could
call it sign or foo or bar if you like) will be an atomic piece of
data. This view also fits in the modern multimedia processing arena,
where sound data consists of 2-byte 2-channel or, for studio work,
24/32-byte multi-channel data structures. Binaries consist of 32bit,
as do video streams, the most complex data we know today. The text
file just hindered us taking this revolutionary step.
If you think this will blow up your filesystem, you are most likely
wrong. Take the sizes of your text files, multiply them by 4 (or 2 if
you are using CJK encoding text files), and compare them with your
wav, MP3, or DivX files' sizes. Those files will not get bigger. The
7bit style is for old Internet routing hardware, but I think that in
another 100 years, it won't still be there. Then, these domains may
also be used. The encoding scheme is clear: As the number, so the
saving and loading. No conversion. Hi-low byte order is preferred and
seems more logical. We could use what Unicode did for 16bit,
seamlessly integrating domains, allowing enough room between language
domains. Unlike UTF-8, we don't want different sizes for Western and
Eastern characters; that makes programmers unhappy and software
difficult to control. Also, UTF-8 emphasizes historical Western
domination of computing science, which is not very friendly. No start
and end sequences -- that's it.
Let's go on. There will still be a mapping between keyboard-sent codes
and the 32bit chars attached to them, as a phase of preprocessing. The
next step will be checking the user's input choice and sending the
data to the parser. This parser will build buffers for input, syllable
buffers, chosen readings, etc. The buffers will belong to the
system. They will be cleared when switching between parsers, but we
need the ability to foresee what we type. Under X, window managers
may have a small buffer dock app in which you could see the language
symbol of what you're typing. Take a look at IME, and you'll know what
I mean. It might be more difficult in the console, but with libs like
ncurses, there might be a way to give a better view on the writing.
Also, shells might stop character echoing and write buffer contents
instead, then clear back to the last breakpoint, write the new sign,
change buffers, and write them again after the new sign. I did this
once with a small Japanese console learning app, and it's fast enough
that you don't see it. When pressing the enter key, I propose that we
use only ready-typed input; otherwise, some shells might want to do so
and some do otherwise, not giving a standard behavior.
Now I will think about one of the most-feared things in the computer
world: The change of data structures and backward
compatibility. First, a single machine holds its data -- text files,
binaries, etc. It is most likely connected to other machines or the
Internet, and that's where I begin. These new generation operating
systems that fully process 32bit data from hard disk to whatever will
be compiled and booted on machines, but will get data through FTP or
other services that will get tons of chars of the old type. Their
buffer (char[]) will be a 32bit[] array, provided by the system
(because sizeof(char) returns 4!). The service now writes it to disk,
but, fortunately, the system does it for us, because the system always
has a fear that some applications might damage the hardware. For the
newly-installed machine, there is no difference in behavior except the
file sizes for texts. When the new machine provides services, there
will be the problem of sending too much information. If the client
wants the next byte (or char), there might be a problem if it gets a
value above 255 (or 127 signed) and dumps an error message or
disconnects.
In the end, there will be no progression without backward
compatibility problems. Network connections are a big advantage here.
We should try to use it and finally throw away our fear, because the
gain will be a clear character processing solution that works
world-wide, with no more hassles with encoding schemes and browser
displaying problems, and a user-friendly, simple-to-use, speedy and
secure multi-language interface. Also, encoding information can be
left out, which cleans up email and XML files. The first big step will
be to change low-level system routines, and for that I wish us all
some more courage towards a change of thinking.
Author's bio:
Mathias Hartwig
was born in Leipzig, Germany. He has studied Japanese and Computing
Science, and a bit of Korean and French. He works as a programmer at
a University Hospital. He started programming with GW-Basic and Turbo
Pascal 5.0 when he was 12, and later used Delphi, Java, C++, and C,
which he still finds to be the best. His first Linux distribution was
a Slackware 3.0, and he remembers his first two hours looking at the
console, typing commands like "DIR", with nothing happening. He now
uses Mandrake Linux both for work and private use, and starts Windows
only to recompile his projects with MingW's GCC. Most of his projects
deal with language learning.
T-Shirts and Fame!
We're eager to find people interested in writing articles on
software-related topics. We're flexible on length, style, and
topic, so long as you know what you're talking about and back up
your opinions with facts. Anyone who writes an article gets a
t-shirt from ThinkGeek
in addition to 15 minutes of fame. If you think you'd like to try
your hand at it, let jeff.covey@freshmeat.net
know what you'd like to write about.
[Comments are disabled]
Comments
[»]
In the middle of secrecy…
by drg001 - Sep 13th 2002 01:34:13
In the middle of secrecy…
(Sometimes, a smile can do more …)
Warning:
You can read further only if you can keep this information top secret from
everybody, including your friends, and, especially from nobody. Do not copy
this document and, God forbid, do not distribute it through e-mail, regular
mail, by word of mouth or psychic ability, even if you do not have one.
http://www.tupbiosystems.com/articles/secret.html
[reply]
[top]
[»]
Japanese and input methods
by Ambrose - Aug 30th 2002 00:18:03
The author should get an update on existing projects (software and
otherwise): mlterm (mlterm.sourceforge.net) can already use more than 1
input method. You can change the input method on the fly, and it will do
charset translation for you (so you can, e.g., use a Japanese input method
to type Chinese words, or vice versa, if Unicode thinks they are the same).
If you like to be confused, you can also change the charset on the fly too.
And yes, there is such a thing as a Chinese console.
As for the future direction on the interoperability of input methods, work
is already started on the implementation of IIIMF, the next-generation
input method framework that takes even Microsoft Windows into account.
Of course, the reality now is that most programs cannot use more than 1
input method. If waiting for IIIMF is not realistic (and it is not
realistic in the short time or the medium term), instead of lamenting on
the reality of the largeness of Unicode and the CJK charsets, we should try
to hack on the problem and try to tackle making programs able to use more
than 1 input method. Perhaps it would be possible to create "input
method proxies" that can call other input methods and translate the
input to the target locale; the existence of mlterm shows that it is
possible for a program to call more than one input method, and I see no
reason why that program cannot be itself an XIM server.
(In fact, because of mlterm, I am trying to look for XIM servers for other
writing systems such as Eastern/Western European, Greek, Russian, etc. But
I can't find them. Perhaps only the people who frequently have to use them
knows where these things are.)
We Chinese used to think that our language is not "scientific",
and this caused our writing system to be despised and sabotaged by our own
people. (I admire the Arabs and the Jews, that they can keep their writing
systems r-to-l despite Western influence.) However, this is wrong because
we can use input methods to enter Chinese, and the speed of Chinese touch
typing is comparable to English touch typing. It is not productive to say a
language or even a writing system is unscientific; even among alphabetic
writing systems, the only truly scientific alphabet is Korean hangeul (if
you don't believe this, dig through sci.lang archives).
[reply]
[top]
[»]
a whole lifestyle
by barrett9h - Aug 12th 2002 17:42:25
It has become a whole lifestyle; to write everything in lower-case
letters and put a smiley on the end of the line seems to express global
thinking.
indeed. =)
But if we are honest, we know it expresses nothing but a deficiency in
modern character processing.
nope. i don't feel like that.
it's really a differentiated form of comunication.
for ex, i speak portuguese, and we have many accented characters. but i
don't use any, not just because it's incompatible with some systems, but
also because it's harder to type. (you see, i don't even use a capital 'i'
for the 'i' word..)
i just use some adaptation, when apropriated.. (for ex. im portuguese "e"
and "é" are two different words, and i type them as "e" and "eh".
everybody understands..) and don't care about being gramatically correct
anyway.
the important thing is to comunicate, and i type much like the way i
talk..
i also find text mode smileys way nicer then the graphicals ones..
[reply]
[top]
[»]
Re: a whole lifestyle
by Ambrose - Aug 30th 2002 00:23:40
> i also find text mode smileys way nicer then the graphicals ones..
yes, text mode smileys are way nicer than graphical ones.
The graphical smileys are *ugly*. And they intefere with the
normal (Chinese/Japanese) smileys I use. I almost always
turn graphical smileys off if a program "supports" them.
[reply]
[top]
[»]
Re: a whole lifestyle
by Hisham Muhammad - Sep 3rd 2002 17:07:57
> i just use some adaptation, when
> apropriated.. (for ex. im portuguese "e"
> and "é" are two different words,
> and i type them as "e" and "eh".
> everybody understands..) and don't care
> about being gramatically correct
> anyway.
Man, this is ugly. This is just like using
"naum" instead of "não". It is, like somebody
else said in this thread, a lack of self-respect
towards one's own language.
> the important thing is to comunicate,
> and i type much like the way i talk..
Yes, this is happening more and more. And it
is very unfortunate. The quality of written text is
decreasing. A friend of mine even wrote "aí eu
peguei e fui lá"... :(
[reply]
[top]
[»]
Re: a whole lifestyle
by barrett9h - Sep 4th 2002 11:14:15
That's true, but I only use this kind of writing when I'm "talking" to my
friends on email, irc, icq.
When writing a letter or something more important I like to write the
"right" way, and I still know to write correctly when I need to... :)
The language is always envolving through history, and this (net talk) is
just one more adaptation.
[reply]
[top]
[»]
Your probably right but I think->
by Butuque - Aug 9th 2002 00:00:14
I can post this with all the crap already on here...
>The i18n movement which started some years ago solves a lot, but not
everything.
>With it, only output is guaranteed to match the best gettext will
find. What about the input?
>Multibyte strings, produced by input parsers like kinput2 or ami in an
8bit or 7bit
environment,
> are hard to handle and crack easily (if you press the delete button,
it removes only half a
sign).
> kinput2 and ami cannot run together in one terminal, because code
pages intersect. Start
and end
> sequences are one solution, but a bad one and one especially not
meant for the long run.
>
> Imagine a document full of different languages; if I want a
function that gives a line length
for
> this doc, it will be the hell,
I say that this is too bad. But if you do make one, make one that I can
easily store like this.
Byte order be damned - eveyone has to do the work.
>and I haven't even mentioned what will happen when new languages
> with new start and end sequences are implemented.
I don't see what will happen.
>Also, we have so many applications which handle text and
formatting.
> Integration of multiple language parsers into them may take 5 times
more than implementing
> the problem-specific algorithms. I think something like Microsoft's
IME, a central
(system-wide)
> solution, is needed here. Unfortunately, IME is not Open Source, and
is therefore
un(sup)portable.
Are we supposed to look at what is out there?
What does IME do?
I don't have the time to check this out, but:
Lets look at it like this::
I want to keep as much of my text file in memory at once as I can. I
am writing a spell
checker and
and it can't take up my whole machine. I want to be able to look up
Western words and
Eastern words.
They are in my database and I want that in memory as much as I can
fit.
: I want to store as Multi Byte Characters (like MBCS only 64 bit).
: most of my Data can be compressed because I am a Pentium 4 on a
programmers desk
and have
CPU cycles to burn.
: I need to be able to traverse this with a pointer.
: I am going to traverse it only once anyway, no need to expand it.
: My data is in records (equiv. to LineFeed, '\0' at end of string,
etc.)
C++ will do the job, but I have to use functions on my pointer
class.
I want to write this code but once, and be able to read this stuff
forwards and
backwards in any \
programming language.
Sounds like CORBA implementation of an interface called the
characterPtr is in order that
everyone will use.
Things I need:
I use things like / and \ and + and ~ that everyone with a computer
likes to use.
These would be nice to fit in with 1 byte characters so I can say in
my compression that:
the next 116 characters are going to be single byte characters.
These would also be nice to fit in with 2 byte data of the next set of
115 charactrer.
My program want to recognize this character, so I will call it:
1111 1111 1111 11XX (substitute the ascii code for '/' for XX) (8
bytes)
it can be stored as :
XX
11XX
XX11
1111XX
XX1111
And, I already know how many bytes long it is, and the Motorolla or Intel
storage format (my
'compression' told me this).
When I look at it using my characterPtr class, it always looks like 1111
1111 1111 11XX.
It is always kept stored compressed. I have to traverse it.
Problem : I don't use CORBA, or I can't because I program in bash.
Solution : write the equivalent of the characterPtr in your language
to access te whole
world
of the newly defined character set that is always, wherever it is
stored or transmitted,
compressed
using the same compression scheme:
If I have data that is 100 strings each an average length of 16
characters and all
characters are the 7 bit,
each string takes up about 20 bytes is encoded The extra 4
characters tell me length of
the string and that
it is all single byte and stored in intel format (don't matter for
single byte, but I know
anyway).
Everyone in the world knows how to read those extra 4 characters and
so to them (us) this
data looks like
sets of 64 bit characters.
Solution 2 : Write a String class that uses the base of this pointer
so that it can traverse
the compression backwardly.
The computer can no longer define the language, we must do that part.
-- //
// Butuque
//
[reply]
[top]
[»]
Re: Your probably right but I think->
by Butuque - Aug 9th 2002 02:47:12
Well,
I wish I hadn't posted that. I still have to consider adding to the
compressed data,
etc.
Let's just forget I posted at all, shall we?
;-) Butuque!
-- //
// Butuque
//
[reply]
[top]
[»]
please, jeff
by Robert Trebula - Jun 26th 2002 08:58:10
please, wipe out the whole comments board, ban access to fm-comments to
this idiot and let's start the discussion again.
it's causing me a headache reading these outputs of some brain-dead
child...
thanks...
Robert
PS: read it through and you'll soon find out who am i talking anout
[reply]
[top]
[»]
Re: please, jeff
by jeff covey - Jun 26th 2002 10:14:58
Trolls only stick around as long as they're fed.
-- vs lbh pna ernq guvf, lbh'er n trrx.
[reply]
[top]
[»]
What happend to rxvt?
by linuxknight - Jun 24th 2002 18:32:18
If you go to rxvt.org, you notice that their web site hasn't been
updated since 2000, and, they say the last release was 2.7.3.
But, if you go to ftp.rxvt.org, you see rxvt 2.7.8 was released in
2001. And, if you go to their cvs on sourceforge, there were
updates until 2 months ago. That's weird. There was no
indication on their web site that the project was dead, but at
first i thought: maybe the project died without notice. But, when
I saw the ftp site had been updated until 2001, that no longer
made since. What happend to rxvt?
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: What happend to rxvt?
by Ambrose - Aug 30th 2002 00:26:53
The official web site for rxvt is at rxvt.org. Please look for updates
there first.
[reply]
[top]
[»]
ASCII
by linuxknight - Jun 20th 2002 19:01:00
Come on, why don't you admit that ASCII has it's
uses? I know we need unicode to allow people who
speak non-roman languages to speak in those
languages, but ASCII is good for the people who
don't speak those languages. I will admit ASCII
is not all we need, but I do think it has uses.
Like, if you use english for your documents, ASCII
contains the character you need. But it does
include german characters. I saw a web page with
german characters in lynx. I pasted the characters
into vi. It worked. All the chars were still
there. All of them. If german characters appear in
lynx, wouldn't that prove that german characters
were supported by ASCII? Not only that, but I
saved that document. Reopened it. All the german
chars were still there. If you make a new
standard, that's fine. Just make sure you keep
it compatible with text mode. So that people can
still use lynx.
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Daniel - Jun 20th 2002 19:54:50
locale -k charmap
I pretty much doubt the output is charmap="ANSI_X3.4-1968" (that
would be pure ASCII).
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 20th 2002 22:41:20
>
> locale -k charmap
>
> I pretty much doubt the output is
> charmap="ANSI_X3.4-1968" (that
> would be pure ASCII).
>
>
The output was:
charmap="ANSI_X3.4-1968"
The two german characters were:
Wait a minute! When I typed out
the cat command on the console to open the file that had
the german chars, they displayed on the console, but when I
tried to paste them into this message, it only showed up as
two question marks! But, when I pasted it into an another
xterm running vi, the chars appeared. And, when I opened
the file using Nedit, the chars appeared, and I was able
to paste them into this message, but when i tried to paste it
from Nedit to the Xterm, they appeared as two question
marks! It seems that the german chars can only be pasted
from one Xterm to another, or from one X application to
another. But, the german chars showed up in the file and lynx
outside X. What happend?
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Daniel - Jun 20th 2002 23:03:29
> The output was:
> charmap="ANSI_X3.4-1968"
You're running in the C locale. It only worked because the characters
weren't checked for correctness and sent directly to the console, which
probably understands ISO 8859-1 by default.
> But, the german chars showed up
> in the file and lynx outside X. What happend?
That's exactly the kind of problem we're trying to fix -- I don't know
what happened. This whole charset nonsense is way too complicated.
LC_ALL=en_US might help you, dunno.
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 20th 2002 23:36:56
>
>
> % The output was:
> % charmap="ANSI_X3.4-1968"
>
>
> You're running in the C locale. It only
> worked because the characters weren't
> checked for correctness and sent
> directly to the console, which probably
> understands ISO 8859-1 by default.
>
>
> % But, the german chars showed up
> % in the file and lynx outside X. What
> happend?
>
>
> That's exactly the kind of problem we're
> trying to fix -- I don't know what
> happened. This whole charset nonsense is
> way too complicated. LC_ALL=en_US might
> help you, dunno.
>
>
I'm sorry, I REALLY thought german characters were
supported by ASCII. What is ISO-8859-1? Is it a subset
of ASCII that supports german chars? But, I was sure
ASCII supported german chars, I saw german chars
on the console. So, the console can
display chars outside ASCII? Well, I thought the console
could display only chars in ASCII. My console can display
chars from all european languages, but not from any
non-roman languages. So, we don't need ASCII to keep
console mode? Since consoles can't display any japanese
chars, what should we do? get rid of the console and only
use X? But, you can't get into X without the console. That
means we wouldn't be able to use any OS based on a
console anymore. Does this mean linux is dead? Are we
doomed to use Windows XP? If you include japanese chars
in a new standard, the console is dead, and linux is dead.
But, if you exclude non-european characters from a new
standard, even if it were 8 bytes per char, linux & the console
will still work. In other words, we need a new standard, but
we need to keep it compatable with the existing consoles.
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Daniel - Jun 20th 2002 23:50:27
> supported by ASCII. What is ISO-8859-1?
> Is it a subset of ASCII that supports german chars?
ASCII is a subset of ISO 8859-1, which is the most commonly used 8bit
extension character set in Europe. But don't think it covers all European
languages.
> In other words, we need a new standard, but
> we need to keep it compatable with the
> existing consoles.
Believe it or not, I'm running UTF-8 in my console:
locale -k charmap
charmap="UTF-8"
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 21st 2002 00:25:26
> Believe it or not, I'm running UTF-8 in
> my console:
>
> locale -k charmap
> charmap="UTF-8"
>
>
UTF-8 is a good idea! But, how did you change you
console to use UTF-8?
-- Signed, Linuxknight
[reply]
[top]
[»]
ASCII
by linuxknight - Jun 19th 2002 01:54:26
Well, i've still got one point: ASCII is capable of supporting
the roman alphabet. For example, I went to a german web site,
found some german characters, pasted it into an xterm running
vi, saved it. closed vi. opened vi with the file. The german
characters were still there, ü for example, was still there. So, if
ASCII doesn't support that character, that would be
impossable. So, ASCII must support that character, right?
Even though your point about people who speak some
language like arabic, japanese, etc. and a roman language
may be true, but, there are a lot of people who speak only
roman-based languages. So, people who only speak roman
based languages should not have to make their files 4 times
bigger for the others, should they? So, UCF-4, etc. already
exist. So, if you plan to speak non-roman languages, use it.
I dislike non-roman languages.
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Mirza Hadzic - Jun 19th 2002 05:21:19
You was just lucky enough. ASCII chars are 0-127 and "Umlaute u" you
mentioned is naumber 129 in many codepages. So, it happens that in several
codepages it is u but in some countries it can be anything else depending
of codepage used. Codepage is the way how these chars (128-255) are
interpreted. In Czech Republic we have many codepages, among others old
"Soviet" style KOI codepage, Kamenicky (also obsolete), Latin-2, ISO,
Windows-1250 to name a few :-). So we even have czech -> czech text file
conversion programs.
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 19th 2002 16:23:50
> You was just lucky enough. ASCII chars
> are 0-127 and "Umlaute u" you mentioned
> is naumber 129 in many codepages. So, it
> happens that in several codepages it is
> u but in some countries it can be
> anything else depending of codepage
> used. Codepage is the way how these
> chars (128-255) are interpreted. In
> Czech Republic we have many codepages,
> among others old "Soviet" style KOI
> codepage, Kamenicky (also obsolete),
> Latin-2, ISO, Windows-1250 to name a few
> :-). So we even have czech -> czech text
> file conversion programs.
Is Czech a european language? Well, anyway, I included
support for every codepage in my kernel. Isn't there a
codepage that supports all of the western alphabet?
I think we have space in ASCII to add all of the extra
characters in european languages. And, why would ASCII be
127 chars? Those extra chars in german, french, spanish, etc.
that were not included in nomal ASCII could be added to the
128-255 gap. I think they would fit. Most of the chars used in
european languages are already in ASCII, so all we have to
do is add the extra chars from european languages. That
would be possable.
P.S Did you know that the german word for cat is katze?
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Mirza Hadzic - Jun 19th 2002 16:57:13
% Is Czech a european language?
>
???
> codepage that supports all of the
> western alphabet?
It is hard to say what is "Western" alphabet. Czech alphabet is as western
as English, so is Swedish, Hungarian, Polish, France, Turkish... even
Bulgarian or Greek which have totaly different layout of letters. So it is
much more then 256 chars and that's a problem. You cannot skip any language
becouse there are institutions like EU which consider all these languages
equal and they want to use several languages inside single document.
[reply]
[top]
[»]
Re: ASCII
by Axxackall - Aug 4th 2002 21:24:45
Don't divide the world like Hitler did. ASCII is for nazi-thinking people.
I support UTF-8 - that's the future for the world where the half of people
speak languages which are not compatible to ASCII.
There is still another and much bigger problem yet to solve: timezones.
By the way, does anyone know anything like UTF-b but targeting the chaos
in timzones?
[reply]
[top]
[»]
Re: ASCII
by Ambrose - Aug 30th 2002 00:41:36
If you read comp.fonts (through Google Groups I suppose, to dig out the old
old articles), you may realize that ASCII does not even support English.
Why? Because there is no code point for the two kinds of dashes that are
required by grammar, and no distinction between opening and closing
quotation marks. Worse (contrary to what some people think and try to make
other people think the same), some code points (e.g., apostrophe and grave
accents) have valid alternative meanings (e.g., closing and opening single
quotation marks). Some English words require accent marks. And the
"Icelandic" thorn and eth letters used to be English letters a very long
time ago.
IMHO, the decreasing quality of English punctuation use is directly
attributable to the spread of computers.
The charset conversion problem is not Unix-specific. Users of Chinese or
Japanese Windows / Macintosh see it all the time. English-speaking people
were just not used to seeing this.
[reply]
[top]
[»]
Re: ASCII
by Miles - Jun 19th 2002 17:31:02
> Even though your point about people who
> speak some
> language like arabic, japanese, etc. and
> a roman language
> may be true, but, there are a lot of
> people who speak only
> roman-based languages. So, people who
> only speak roman
> based languages should not have to make
> their files 4 times
> bigger for the others, should they? So,
> UCF-4, etc. already
> exist. So, if you plan to speak
> non-roman languages, use it.
> I dislike non-roman languages.
A couple of things. First of all, the most spoken language is not
English. It's Mandarin (Chinese). There is obviously a place for
non-ASCII character encodings.
Second, text files (as mentioned in the main article) are usually in the
minority for space used on personal computer systems.
Third, there is nothing stopping you from running something like gzip on
the text files which will not only get rid of the multi-byte tax, but
they'll end up smaller than the equivalent ASCII file.
Fourth, I don't think it's really an issue for most folks when 100GB hard
drives are becoming normal; that's a whole hell of a lot of text,
multi-byte or not.
Fifth, it's UCS-4. Heh heh... nitpicking. ISO/IEC 10646 encoding form:
Universal Character Set coded in 4 octets. UTF-8: Unicode (or UCS)
Transformation Format, 8-bit encoding form.
Sixth, this discussion is from a programmer perspective and not really a
user perspective. When you say that you dislike non-roman languages, I
assume that you have no real experience with non-roman languages. That's
fine I guess, but are you willing to state that none of the users of the
programs you write like non-roman languages? This is the real issue.
Finally, if you use UTF-8 on your system, you will see no appreciable
amount of wasted space over ASCII, but the potential to hold most other
characters is still available. And as an added bonus, all of your standard
ASCII documents are still valid. There is very little excuse to only
support western characters when such an easy alternative is available.
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 20th 2002 22:57:26
> issue for most folks with 100GB hard drives
My computer does'nt support 100GB hard disks,
I only have a 20GB hard drive. Am i supposed
to pay $6000 for a computer with a 100GB hard
drive?
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by Miles - Jun 21st 2002 14:02:26
Where do you live that a new computer costs $6000? A new motherboard
here costs anywhere from $80-$200. A whole new (very fast) computer can be
purchased for less than $1000.
But you're right. A lot of people still have 20GB. That's only about
20 billion* characters of text (ASCII or UTF-8 in a western country) give
or take a couple of billion for program data. Now let's move to UCS-4:
still 5 billion (give or take) characters. Now let's compress that with
something like gzip: getting closer to 150 billion characters.
I fail to see your point.
* U.S. billion -- thousand million in some other locales.
[reply]
[top]
[»]
Re: ASCII
by linuxknight - Jun 21st 2002 18:59:59
> Where do you live that a new computer
> costs $6000? A new motherboard here
> costs anywhere from $80-$200. A whole
> new (very fast) computer can be
> purchased for less than $1000.
> But you're right. A lot of people still
> have 20GB. That's only about 20
> billion* characters of text (ASCII or
> UTF-8 in a western country) give or take
> a couple of billion for program data.
> Now let's move to UCS-4: still 5 billion
> (give or take) characters. Now let's
> compress that with something like gzip:
> getting closer to 150 billion
> characters.
> I fail to see your point.
> * U.S. billion -- thousand million in
> some other locales.
>
I didn't mean 20GB wasn't enough for UTF-8. I've started
using UTF-8 now. UTF-8 is a good idea, because all ASCII
chars will still be ASCII, but other chars would be encoded as
2-4 bytes. Good deal. But, you shouldn't assume everyone
has 100GB hard drives. I only got a 20GB drive 1 year ago.
Before that, I only had a 3GB drive. The only reason I got a
20GB hard drive was because the 3GB one wore out. My old
486DX still uses a 1GB hard drive. As for your question about
where I live: Dallas, Texas, USA
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: ASCII
by MikeFM - Jul 20th 2002 01:51:59
>
> % issue for most folks with 100GB hard
> drives
>
>
> My computer does'nt support 100GB hard
> disks,
> I only have a 20GB hard drive. Am i
> supposed
> to pay $6000 for a computer with a 100GB
> hard
> drive?
>
My computers are mostly ancient P120's and several of them have drives
larger than 100 gigs. Most computers can support these sizes ofd rives if
you upgrade their bios. In cases where they still have problems the hard
drive companies typically have a utility you can run that'll run before
your OS that will enable support for the large drives. I've yet to find a
Pentium or newer computer that couldn't support any size drive I slapped
into it.
[reply]
[top]
[»]
Re: ASCII
by Gene Montgomery - Jun 29th 2002 13:11:28
> A couple of things. First of all, the
> most spoken language is not English.
> It's Mandarin (Chinese). There is
> obviously a place for non-ASCII
> character encodings.
Mandarin Chinese may have the largest number of persons who speak it as
their first language, however, It is debatable as to whether Chinese is
the "most used" language in the world.
See
As a reasonably well-traveled individual, I never cease to be amazed at
the number of countries I have visited where English (many times with an
accent) is available to the locals - and that applies to the Middle East,
South Asia, Japan, Korea, Viet Nam, the Phillipines, Holland, Scandinavia,
parts of Africa, and other places. And in none of these locations would
English be classified as the "native" tongue. Yet in some, it is preferred
to the local language(s) or dialect(s) because of its universality.
In the computer software discipline, the employment of English (or a
language with primitive elements derived from English/Romance words, such
as Fortran, Pascal, Ada, C, C++, and so on) is pronounced. I know of no
computer programming language wherein the base language (spoken or written)
is Mandarin Chinese. Indeed, I believe that the Chinese Idiogrammatic
language forms would prove difficult to use as the
basic "alphabet" or symbology of a computer programming
language.
My experience is that even the Japanese, who type into word processors
and personal computers, use an anglicization/romanization at the keyboard
called "romaji" to enter the sounds of the Japanese language, which are
then converted to Katakana/Hiragana, and/or Kanji, for output to the
display. I have done it while in Japan, but it isn't easy for one whose
Japanese is limited. However, it is second nature to PC-aware Japanese
(and they are becoming intensely PC-aware).
Although I have long since forgotten the details, I remember reading
perhaps 40 years ago about an English person who invented a romanization
system for the Chinese a long time ago, which sounded similar to the
Japanese method of getting from
Romaji to Kanji. IIRC, he did it to help little children learn sounds of
the language, and enable them to bridge to the concepts in the ideograms.
BTW, the Chinese ideograms,
while extremely similar to the ones used in Japan, and may have similar
meanings, rarely have even remotely similar pronunciation. So, a Japanese
person can probably get the gist of a written Chinese ideogram, but would
have to know (one of) the Chinese dialects to be able to verbalize the
ideogram.
I guess the point of this is that even in the Orient, where the
idiogram reigns supreme with some of the literati and illuminati, they are
forced to revert to the primitive 26-character English alphabet to enter
computer input. So, when you think about i18n, I suggest thinking about
the input issues as well as the output(display) issues, and I suggest that
there are basic communication issues to be addressed - not just between
peoples, but also between the human and the machine. For
example, in Japan, romaji, and the English language is universally taught
as a second language, and has been for a great many decades. It is
necessary, since romaji is taken as a given.
[reply]
[top]
[»]
Re: ASCII - WOOPS my ref. dropped out...
by Gene Montgomery - Jun 29th 2002 13:20:45
> See
http://iteslj.org/Articles/Kitao-WhyTeach.html
[reply]
[top]
[»]
Re: ASCII
by Miles - Jun 29th 2002 13:58:36
> Mandarin Chinese may have the largest
> number of persons who speak it as their
> first language, however, It is
> debatable as to whether Chinese is the
> "most used" language in the world.
On the contrary, I do believe that it is most used. English, however, is
most likely the most widely known and understood. English is by far the
most widely used for commerce, but other than that, more often than not,
people in different countries converse in their native language (they *use*
a non-English language).
But this is missing my point. For some curious reason, some folks have
come to believe that I think that all English-based programming syntax
(if/while/for) should be changed to Chinese pictographs. For the last
time, hear this: I did not say this! I did not imply this. I implied that
there is a sufficiently large number of people to warrant the processing
and display of non-romanized text.
I am aware of the use of romaji for input purposes in Japan. I am also
aware that many of those uses of romaji also include a
kanji/hiragana/katakana translation step. In other words, after the romaji
is input, a list of applicable pictograph substitutes is presented for
selection. This is not every case, but it is a very common case in my
experience. To put it into context, it would be the equivalent of an
English speaker always writing 'to' in their writings whenever 'to', 'two',
or 'too' was intended. Sure the reader could figure it out, and they all
sound the same, but wouldn't it be better in many cases to take the time
and select the correct one? "I have to presents to give to my to
brothers to." See what I mean?
A keyboard does not have to pictograph-based in order for a computer to
handle pictograph data. There are also advances in handwriting technology.
Right now, it is not uncommon for reporters in Japan to hand-write their
notes and transcribe them later instead of using a laptop and romaji
translation because the laptop is slower for them. Handwriting recognition
would remove this barrier. Of course, it would require that an i18n
capable OS and editor is available -- hence the point of this
discussion.
Note: the point of this discussion is not that we sould scrap all
keyboards in current use either.
And yes, I am aware that while the characters of China and Japan are very
similar, their speech, pronounciation, and cultures are very distinct. In
fact, other posts of mine in this discussion have pointed out this fact;
however as not all computers have text-to-speech engines and are visually
accessed from a screen in most cases, the importance of being able to
display and input the characters is still much more relevant.
Yes, east asia uses the romanized alphabet extensively. Any visit to east
asia, however, will demonstrate very quickly that it is not used to the
exception of pictographs. There is demand out there to handle both.
[reply]
[top]
[»]
Two steps approach
by Mirza Hadzic - Jun 18th 2002 09:40:53
1. Codepages *SUCKS*
2. Moving all Linux code to UTF-8 as a first step of getting rid of
codepages is good. I expect all important linux code will be UTF-8
compatible in two years at most.
3. After we have everything translated to UTF-8, we can move to UCS-4 at
no time. Moving there from point where we are now would be much more
complicated becouse of codepages. This step (UCS-4) will make string
operations both easier to write and faster while Only UTF-8 -> UCS-4
conversion will be needed in the system assuming all text will be UTF-8
already.
[reply]
[top]
[»]
Re: Two steps approach
by Daniel - Jun 18th 2002 17:29:57
> 3. After we have everything translated
> to UTF-8, we can move to UCS-4 at no
> time. Moving there from point where we
> are now would be much more complicated
> becouse of codepages. This step (UCS-4)
> will make string operations both easier
> to write and faster while Only UTF-8
> -> UCS-4 conversion will be needed in
> the system assuming all text will be
> UTF-8 already.
Yes, this is the way to go. And I suspect a lot of apps won't even need to
convert to UCS-4 later. It should definitely be a per-app decision to do so
-- if you just want to display strings, as most GUI apps do, there is no
need to deal with character-conversion at all.
[reply]
[top]
[»]
Re: Two steps approach
by linuxknight - Jun 19th 2002 16:36:08
> 1. Codepages *SUCKS*
That's not valid english. It's not "Codepages sucks",
It would be "Codepages suck", and they don't.
-- Signed, Linuxknight
[reply]
[top]
[»]
Re: Two steps approach
by Daniel - Jun 19th 2002 17:32:56
OK, let's play the game...
> > 1. Codepages *SUCKS*
>
> That's not valid english. It's not
> "Codepages sucks",
That's not valid English either. a) You should use a capital letter, and
b) a sentence ends with a period, not a comma.
> It would be "Codepages suck",
> and they don't.
They do suck, as well as you. Now go home to mom and let us do something
more productive than replying to this ridiculous shit you're typing.
(Yeah, I know, I shouldn't feed the trolls. I just couldn't bear it, sorry
everyone.)
[reply]
[top]
[»]
Your title, "True Internationalization" holds the key to the requirements.
by Curtis Veit - Jun 15th 2002 18:13:09
Good and very relevant topic. (Stepping up on soapbox)
If it is true internationalization you are after then the only good
existing answer is UTF-8 which allows multiple languages in the same
document without context sensitivity for characters based on position in
the document. If you really require 4 byte representation within your
program you can alway convert to UCS-4 while you do your private magic.
Please read up on all the existing choices, If you do, I suspect that you
will see the many advantages of UTF-8.
This also provides the advantage of being usable for multiple languages on
machines that were designed for 8 bit ascii charaters without even
requireing unicode conversion routines (as long as you only use ascii and
utf-8). This is absolutely brilliant for embedded devices where space is
still an issue.
Unfortunatly 32 bit chars do require conversion libs and will be context
sensative because you cannot fit all the possible characters for all
languages into a single 32 bit code space. - Or perhaps you are
proposing that some people's languages are not important enough to
include?
(This requires special codes to switch to a new code space. The resulting
context problems are a much more severe programming problem than different
byte lengths for various characters.) Also you can program your Open (or
closed) source programs in utf-8 today with more than one language used in
the source and it works fine, (even on my ancient systems).
You are exactly right about needing full international support in
computers today. From what I have seen the people doing real work in this
area go to UTF-8. You can probably tell that it is my choice as well.
(A question for Unicode wizards: why is there a common practice to be
converting utf-8 to ucs-4 for storage? To me utf-8 seems to be ideal for
both storage and use within programs.)
Regards,
Curtis
[reply]
[top]
[»]
Re: Your title, "True Internationalization" holds the key to the requirements.
by David Roundy - Jun 16th 2002 07:52:42
>
> Unfortunatly 32 bit chars do require
> conversion libs and will be context
> sensative because you cannot fit all the
> possible characters for all languages
> into a single 32 bit code space. -
> Or perhaps you are proposing that some
> people's languages are not important
> enough to include?
> (This requires special codes to switch
> to a new code space. The resulting
> context problems are a much more severe
> programming problem than different byte
> lengths for various characters.)
>
How many languages do you think there are? 32 bits would support 40
million languages with 1000 characters each! Admittedly, there are some
languages with more than 1000 characters, but it's also true that many
languages share a character set.
[reply]
[top]
[»]
Re: Your title, "True Internationalization" holds the key to the requirements.
by piman - Jun 18th 2002 16:04:07
> many languages share a character set.
Yeah, and as soon as we try to combine them we get some of the existing
CJK Unicode problems. The only Right Way to do it is delineate by language,
not by typography.
[reply]
[top]
[»]
Re: Your title, "True Internationalization" holds the key to the requirements.
by David Starner - Jun 25th 2002 01:05:17
>
> Yeah, and as soon as we try to combine
> them we get some of the existing CJK
> Unicode problems. The only Right Way to
> do it is delineate by language, not by
> typography.
Let me guess; your native language is either C, J or K? There are two main
problems to your solution. First, there are about 5,000 languages by some
counts, and boundaries are very ill-defined in some case; worse yet,
computers have to handle historical texts, which add a whole new dimension
to the problem.
Secondly, while Chinese and Japanese may believe their character sets are
entirely disjoint, Europeans usually percieve the Latin character set as
one connected whole. "valet" comes out of my keyboard just fine,
and few would argue that it needs to be stored or displayed differently if
it's French or English. Likewise, saying that "Ulrich Drepper, Robert
M�fc;ller and Ri
d;ard 
c;epas worked on a
project" (freshmeat won't let me include the actual characters) comes
naturally. Note that the only mono-lingual European character sets are
ISO646-*, which had only a few characters to work with. Once eight bit sets
became common, all major European character sets covered multiple
languages.
[reply]
[top]
[»]
Why is this taking the form of a religious debate?
by Miles - Jun 15th 2002 17:48:25
UTF-8 should be used for a text storage format. Why? Since the vast
majority of documents in the world, when saved, will take up less space on
the hard drive. Let's face it, as has been mentioned elsewhere, most of the
text files out there are compatible with ASCII. This is not racism or
imperialism. It's pragmatism.
String manipulation within programs (in-memory) is a whole different
story, however. Here, a fixed-size character makes more sense in most
cases. While the case for the program that simply reads data in and spits
it out again verbatim is a failrly common one, the vast majority of
real-world programs manipulate character strings during the course of
processing.
To be more clear about this, getting the Nth character of a
fixed-char-size string is a constant-time operation (O1) and takes the same
amount of time whether N is equal to 5 or 500. On the other hand, getting
that same character in a variable-width-char string is a linear operation
(On) and takes approximately one hundred times as long to get the 500th
character versus the 5th character. The same holds true for substring
operations. Character replacement gets a lot more tricky as well. If all
characters are treated the same, what happens when someone tries to replace
the Nth character (which for the sake of this exercise, we'll call 'c')
with a Thai character? In a fixed-width character string, this works just
as in an ASCII string. In a variable-width character string, a lot of extra
processing and data movement is necessary or subsequent characters will get
overwritten due to the size difference between the two characters.
To ignore the existing base of data is a recipe for the existing base of
programmers and writers to ignore you.
The first step is UTF-8 compatibility. This is a minor change to most
programs. Without some tie to a universal character encoding, i18n is
impractical for all intents and purposes.
The second step is to impress upon programmers the algorithmic efficiency
advantages of using fixed-width characters in their programs instead of
variable-width.
Finally, as time goes on and new programs are created -- especially with
the advent of newer languages that encourage better i18n behaviour --
people may find it easiest to save their data in UCS-2 or UCS-4 because
that is the serialized form of their in-memory data structure. Then and
only then will you see the widespread changeover. Anything else is pissing
into the wind.
Will it happen overnight? No. Will it happen in our lifetimes?
Probably. Is it worth it to rip apart all of our existing infrastructure,
effectively stop all new development, effectively halt all existing
development, and recode everything right now with 4-byte characters? I
certainly hope that you say 'no'. Otherwise you will be advocating the
betterment of society by making it a horrible place to live.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Srin Tuar - Jun 15th 2002 23:21:02
> To be more clear about this, getting the
> Nth character of a fixed-char-size
> string is a constant-time operation (O1)
> and takes the same amount of time
> whether N is equal to 5 or 500. On the
> other hand, getting that same character
> in a variable-width-char string is a
> linear operation (On) and takes
> approximately one hundred times as long
> to get the 500th character versus the
> 5th character.
> The second step is to impress upon
> programmers the algorithmic efficiency
> advantages of using fixed-width
> characters in their programs instead of
> variable-width.
this is not necesarrily true: with combining characters, bidirectional
text, and other unicode features you will still need to do the same amount
of work with wide characters as you do with multibyte characters.
An additional benefit of UTF-8 internal use is byte-order independance,
which bypasses a perrennial problem faced when making code portable.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Miles - Jun 16th 2002 15:34:50
Combining characters are not as much an issue if you enlarge your
character size. Assuming a 32-bit character with one bit excluded for
internal use (effectively a 31-bit character) you have 2.15 billion
characters from which to choose. You will forgive me if I don't see an
immediate and urgent limitation.
"Planning for the future" is not really viable in that the future tends to
draw out possibilities previously unseen. You have to work with what you've
got. What happens if we have visitors from outer space and must incorporate
their characters (and the other billion species')? No, not very likely. Not
likely at all. But it would totally invalidate any "perfect" solution that
we might come up with today. I imagine there are other possibilities not as
remote as the "alien contact" mentioned above that would still put a
monkey-wrench in a "perfect" character-encoding solution.
As for bi-directional text (misleading when you consider that text in the
world has more than two possible directions -- think up/down), the display
order of the text is not necessarily the logical order. Just because the
display of the text is right-to-left (for example) doesn't mean that
the characters must be kept in memory in contrary order to left-to-right
characters. It just means that the first character in a block is rendered
on the right instead of the left. It does not have to be a BiDi text
issue.
You're right that byte order could be an issue on some platforms when
serializing strings. Explicit serialization to UCS-2 or UCS-4 is needed.
Point taken.
[reply]
[top]
[»]
I almost forgot...
by Miles - Jun 16th 2002 15:52:23
Just in moving people away from ASCII and the assumption that characters
should be treated as a single byte, you will have solved 90% of the
problem. Honestly, who cares what universal encodings are out there as
long as you recognize that other encodings exist. Once a programmer
recognizes that encodings (besides ASCII) exist and are worth supporting,
the simple/dumb routines for character input will start to fade into the
background.
Once the data's in memory, who cares what format it's in? The only ones
who must interoperate with in-memory strings are the developers and the
maintainers and they should be using whatever encoding best fits the task
at hand. In some cases, it'll be UTF-8. In others, it may be UCS-4. In
others, Shift-JIS may fit the bill. As long as there is a conversion
routine to an from Unicode from whatever you are using in your program, you
can get from any encoding in the world to any other.
But of course, the hard part so far is getting C coders to accept a
non-ASCII world. This is not spite toward C for spite's sake. C (and its
derivatives) are some of the last popular holdouts that (a) have little
i18n and l10n support in the standard language and (b) makes little
distinction between the concept of a character/string and a byte-array that
represents a character/string. If it were more of an abstraction --
putting in a different back-end and letting the compiler deal with these
details -- the world would be a better place (most likely with fewer buffer
overflow vulnerabilities as well).
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Shiro.k - Jun 16th 2002 19:32:49
I do agree that fixed-length character string representation is more
efficient than variable-length one.
But I think it's not so bad as some claim.
> To be more clear about this, getting the
> Nth character of a fixed-char-size
> string is a constant-time operation (O1)
> and takes the same amount of time
> whether N is equal to 5 or 500. On the
> other hand, getting that same character
> in a variable-width-char string is a
> linear operation (On) and takes
> approximately one hundred times as long
> to get the 500th character versus the
> 5th character. The same holds true for
> substring operations.
I wonder if this is really an issue. When I process the text, what I do
mostly is either scan the text sequentially, or use some search operation
(like substring match or regexp match). Indices are returned as the result
of the search operation so that I can extract the matched region from the
string, but they're not necessary to be a character index---any kind of
pointer does the job, and it's possible to create such pointer object that
access any part of string in constant time.
I hardly see the case that I have to apply a pre-determined character
index that is not a result of search operation. Maybe my experience is
limited.
The searching operation for variable-length characters is slower than the
one for fixed-length. That's an issue. But it's not as bad as O(N) versus
O(1).
>Character replacement gets a lot more tricky as well.
Yes, it is tricky. But again, how common is the operation to substitute
characters in place? When I use the programming languages that don't have
automatic memeory management, I tend to do the in-place replacement as much
as possible.
But in many situations the size of replacement string differs from the
size of original region and I end up reallocating whole string.
If the language supports garbage collection, I even think that prohibiting
in-place replacement is more beneficial,
because it allows me to share substrings without worrying that the shared
storage will be modified inadvertently.
(I assume I use some kind of "string object" that has the
pointer to the storage of actual string).
I don't have any concrete performance comparison between fixed- vs
variable-length string representation.
Excuse me if my idea is irrelevant.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Miles - Jun 19th 2002 17:10:06
> I wonder if this is really an issue.
> When I process the text, what I do
> mostly is either scan the text
> sequentially, or use some search
> operation (like substring match or
> regexp match). Indices are returned as
> the result of the search operation so
> that I can extract the matched region
> from the string, but they're not
> necessary to be a character index---any
> kind of pointer does the job, and it's
> possible to create such pointer object
> that access any part of string in
> constant time.
True enough. In this case, you are right (for C). I've been spending a
lot of time in higher-level languages and forgot some of my C idioms.
> > Character replacement gets a lot more tricky as well.
>
> *** edited for brevity ***
>
> But in many situations the size of replacement string
> differs from the size of original region and I end up
> reallocating whole string.
And here is the crux of the matter. You are experienced. You probably
remember to do this everytime. However you do not write the vast majority
of software out there. I would venture a guess that the vast majority of
software is written by someone who is not as proficient as you are. There
is nothing stopping a good coder from doing as you say. However, a lot of
coders aren't that good of coders and haven't yet learned good practices
or, more to the point, good practices with regard to non-ASCII character
encodings. Far more likely, even if someone knows better, after a many
hour coding session, programmers make stupid mistakes. When all of your
unit testing is with standard ASCII strings (for example), the compiler
won't catch the error. There are far more programmers out there that have
written code in C for fewer than two years than programmers who have
written C for more than five years. While this is true for every language,
many other languages have built-in abstractions for character strings that
simply don't exist for standard C.
> I don't have any concrete performance comparison
> between fixed- vs variable-length string representation.
> Excuse me if my idea is irrelevant.
Not at all irrelevant. For most of the situations you describe, you are
correct in that the speed difference would be negligible (in C). I was
more concerned with maintainability -- especially in situations where you
are not the maintainer. But for the most part, you were more correct than
I with regard to the common cases.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Bill Spitzak - Aug 6th 2002 02:51:24
I recommend UTF-8 ONLY for percisely the reasons you say require wide
characters. Maintainability and testing.
If there are two interfaces for "ASCII" and "Wide characters" then the
typical programmer is only going to test the "ASCII" interface and there
are going to be bugs when an i18n user tries it. However if there is only
ONE interface, "UTF-8", then that interface is going to be tested!
Also no "amateur programmer" is going to successfully replace any
characters in any string. The function "replace chars n-m with these m-n
other characters" just is not used by anybody. Check Visual Basic if you
don't believe me, that function does not exist (the replacement is allowed
to be a different size). Any amateur programmers coming from that
background are not going to want to do anything that you cannot do in
UTF-8.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by Bill Spitzak - Aug 6th 2002 02:47:02
Absolutly agree. I don't understand why people thinking that indexing the
Nth character needs to be fast.
In any text-processing I can think of, N must be calculated first by
scanning all the characters before it. It is trivial to replace N by the
byte count and continue with the algorithim as before. So I see no savings
there.
Also all modern text processing thinks about "words" and these are
variable-length. It makes no difference if the characters inside them are
variable length as long as it is easy to detect the word boundaries.
I recommend, with NO EXCEPTIONS, that UTF-8 be used for every single
interface in a system where text is passed. There should be no "ASCII"
interface, and certainly there should be no "wide character" interface. I
don't think any programs will have to store or manipulate text in any form
other than UTF-8.
A huge win with the UTF-8 only is that it eliminates the need for multiple
interfaces. strlen() and so on are unchanged except they are defined to
return the number of bytes in the string.
[reply]
[top]
[»]
Re: Why is this taking the form of a religious debate?
by linuxknight - Jun 19th 2002 16:42:00
> most of the text files out
> there are compatible with ASCII. This is
> not racism or imperialism. It's
> pragmatism.
I agree with that.
-- Signed, Linuxknight
[reply]
[top]
[»]
Two different point of view
by Alessandro Staltari - Jun 15th 2002 16:34:25
I think character set issue is mostly related to documents and graphical
user interface rather than coding and consoles.
For the latter ASCII may be enough (does it make sense translate shell
commands and language keywords? MS Excel say yes, but I'm not sure it is
so useful).
Low level access to the system can still be ASCII, while for higher level
interfaces we can use i18n and libraries to handle it.
[reply]
[top]
[»]
Sorry, UTF-8 *is* the way to go
by Srin Tuar - Jun 15th 2002 15:56:06
Rather than re-writing virtually all existing source code, it makes
infinitely more sense to go with UTF-8. In fact this decision has already
been made, and disparate operating systems from windows to linux (in a big
way linux) are slowly standardizing on it.
Utf-8 gives you compatibility with ascii, full access to the full 31bit
unicode (unicode saves the one extra bit for error codes, sign bits etc,
very smart!), an error-recoverable byte stream, stateless computationally
trivial conversion, very low overhead for most existing text, ovewhelming
compatibility with existing software (no code changes for most software!!),
relativly trivial string width computation
see for yourself:
"http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c"
Using UCS-4 would be a huge headache with few benefits. It would also
introduce all new kinds of bugs, like for example assuming than number of
ucs-4 chars would equal the display width of the string (not true, see
combining chars, zero width glyphs, etc)
[reply]
[top]
[»]
16 bits is enough
by Ben Crowell - Jun 15th 2002 12:35:06
It's really a myth that 16 bits isn't enough. People make statements about
how there's some huge number of Chinese characters, too many for 16 bits.
But the reality is that most of these characters are names that parents
have had experts create for their children in order to be able to name them
something special. Nobody can read these characters. (Heck, as an American,
I could name my kid with some weird symbol, and then complain that it
wasn't implemented on computers.) The typical Chinese person knows a
relatively small number of characters, and even highly educated people know
far less than 2^16.
[reply]
[top]
[»]
Re: 16 bits is enough
by Shiro.k - Jun 15th 2002 18:01:37
You're almost right that you don't need more than 2^16 chars for daily
life,
but we Japanese have historical documents and classic literatures.
We can't change names of existing or existed people and places
just to fit them in 2^16 charset.
Unless you suggest us to abandon our history, 2^16 is not enough.
(If you mean surrogating pair in UTF-16, yes, it's enough
so far to
represent those rare characters in the stream of 2^16 codes.)
[reply]
[top]
[»]
Re: 16 bits is enough
by cpchan - Jun 15th 2002 18:10:17
> the reality is that most of these
> characters are names that parents have
> had experts create for their children in
> order to be able to name them something
> special.
What have you been smoking? I want some! I both speak and write Chinese
fluently and this is the first time I have heard of such nonsense.
Charles
[reply]
[top]
[»]
Re: 16 bits is enough
by David Starner - Jun 25th 2002 01:10:35
> It's really a myth that 16 bits isn't
> enough. People make statements about how
> there's some huge number of Chinese
> characters, too many for 16 bits. [...]
It's really a moot point. There's no competing 16 bit standard. To support
Cantonese, Hong Kong Chinese and modern musical notations (all important
things), one must support full 32-bit Unicode. (For Cantonese, I'm told one
of the characters is the equivalent of the English -ing suffix, necessary
for almost any written Cantonese.)
[reply]
[top]
[»]
What about typesetting?
by Gregor Mueckl - Jun 15th 2002 10:59:11
This article only covers the characters and fonts involved in outputting
text written in different languages. That's a problem in multilanguage
environments like the internet, but it's not the only one.
The output has to be displayed correctly as well. Western languages are
written and read from left to right. Most text printing routines can only
handle that way of outputting text. But it's not the only one. Arabic text
goes from right to left (just the opposite of western languages). And if I
remember corretly, some eastern languages are even written from top to
bottom, that is, in colums rather than lines.
So you have to think about a way to handle output as well. A graphical
system can print single characters correctly, but they need to be alligned
correctly to form a text that makes esnes - sorry, I meant
"sense".
I think that this becomes a serious problem if you build a system that is
capable of printing all those characters at the same time. How do you
typeset a text that contains passages in English, Arabic and Chinese?
[reply]
[top]
[»]
Re: What about typesetting?
by Christian Rose - Jun 15th 2002 13:04:39
> The output has to be displayed correctly
> as well. Western languages are written
> and read from left to right. Most text
> printing routines can only handle that
> way of outputting text. But it's not the
> only one. Arabic text goes from right to
> left (just the opposite of western
> languages). And if I remember corretly,
> some eastern languages are even written
> from top to bottom, that is, in colums
> rather than lines.
>
> So you have to think about a way to
> handle output as well. A graphical
> system can print single characters
> correctly, but they need to be alligned
> correctly to form a text that makes
> esnes - sorry, I meant
> "sense".
>
> I think that this becomes a serious
> problem if you build a system that is
> capable of printing all those characters
> at the same time. How do you typeset a
> text that contains passages in English,
> Arabic and Chinese?
Umm, I think these problems have in large parts already been solved by
systems like Pango etc. Are you familiar with Pango?
[reply]
[top]
[»]
Re: What about typesetting?
by Emil Perhinschi - Jul 3rd 2002 19:52:06
Why not let all this internationalization to be the burden of typesetting
and wordprocessing programs? The ``source'' can be stored in ascii without
much trouble ... I did this in LaTeX with English, Romanian and Polytonic
Greek in one document ...
Then think of the costs of internationalization, the trouble of
standardization and coping with programmers' hybris ...
Emil
P.S.: I have no claim to correct English in my reply
[reply]
[top]
[»]
32-bit text format
by Deekoo L. - Jun 15th 2002 07:18:24
(Of course, this appears right as I'm uploading
a UTF-8-native version of Yeemp...)
Problems - first off, everything has to be
audited and recompiled - malloc(strlen(src)+1)
needs to become malloc(strlen(src)+sizeof(char))
everywhere it appears.
Second: Confining things to 7bit seems wasteful.
The only reason to use 7bit is to transfer data
cleanly over 7bit-only links. 7-bit protocols
will require that the commands be in 8-bit chars,
even if the data are in 32-bit chars. To deal
cleanly with many 7-bit protocols, you'll need to
avoid using a large number of control and ASCII
glyphs in the 32-bit chars. Worse, the glyphs
to avoid differ from protocol to protocol.
Embedded nulls and CRs or LFs will break almost
any 7-bit protocol; @ signs in the wrong place
will choke SMTP; . will confuse domain resolvers;
space will confuse webservers. The characters
remaining for your encoding (and that's just after
chopping the ones that I think'll cause problems)
will probably make Base64 look pleasant. Further,
parsing a glyph index composed of discontiguous
septets in a 32-bit word will be a nuisance to
any program which has to deal with them. If
you're changing the char size, it breaks enough
stuff as-is that there's no point in trying to
get it 7-bit-clean *too*.
However, I do want one single giant character set, whether
it's 16-bit, 32-bit, or something else. Having
to tag every bit of content with encodings is
annoying (when it's text files), infuriating
(when it comes to files with multiple chunks of
data in different encodings), unfeasible (how
am I supposed to indicate the encoding for
user@host.co.uk, where 'host' is in Devanagari
and 'user' in Hangul?), and unreliable (when every
web browser comes with a list of every encoding
that any other web browser ever claimed to support...).
IMO, display, character sets, and input are
things that should be semi-independant - I don't
want to put the BFF on my emergency boot
disk, and input methods that are great for one
language may be suboptimal or terrible for
another.
[reply]
[top]
[»]
sizeof(char)
by mikpos - Jun 15th 2002 10:45:49
> Problems - first off, everything has to
> be
> audited and recompiled -
> malloc(strlen(src)+1)
> needs to become
> malloc(strlen(src)+sizeof(char))
> everywhere it appears.
Do you mean malloc(strlen(src)+1) should become
malloc((strlen(src)+1)*sizeof(char)) everywhere? In any case it doesn't
matter since the C standard requires that sizeof(char) be 1. Any compiler
which doesn't make sizeof(char) 1 is non-conforming. This makes sense
since the sizeof operator returns the size of its operand in chars. How
many chars are there in a char? Well 1 of course.
Of course this does not preclude 32-bit chars. There are lots of
implementations out there where CHAR_BIT is 32.
Using 32-bit wchar_ts would make more sense than making 32-bit chars
though.
> However, I do want one single giant
> character set, whether
> it's 16-bit, 32-bit, or something else.
I agree. We already have such a thing, though: UCS-2 and UCS-4. Strings
of Unicode characters that are uniformly 16 bits or 32 bits (respectively)
in size. I didn't see the author proposing anything that UCS-4 wouldn't
fix.
I'd rather use UTF-8 than UCS-4, though.
[reply]
[top]
[»]
It's not that bad, actually
by Daniel - Jun 15th 2002 06:40:54
What's wrong with UTF-8? It's an 8 bit multi-byte encoding, and therefore
independent of endianess. It's absolutely compatible with plain 7 bit ASCII
and easy to program for. Most of the code written for ASCII continues to
work with UTF-8, e.g. substring search.
Also, it's just not true that Unicode text files start with a special
character sequence. That might be a bad Windows habit, but it's not
required by any standard. Once everyone moved to UTF-8, we can forget about
all those character set problems.
Unlike UTF-8, we don't want different sizes for Western and Eastern
characters; that makes programmers unhappy and software difficult to
control. Also, UTF-8 emphasizes historical Western domination of computing
science, which is not very friendly. No start and end sequences -- that's
it.
That's just nonsense. UTF-8 is not hard to program for. And if you don't
want to do it yourself, there are a whole lot of libraries out there that
deal with it. Trust me, it's going to become the standard on
GNU/Linux systems in the near future.
Regarding "Western domination": Please try to view this more
pragmatically. The worst that can happen with UTF-8 over UCS-4 (the 32bit
Unicode encoding), is that a full-length character needs 6 bytes rather
than 4. But in practice, most characters won't need full 6 bytes -- for
instance, Japanese fits just fine into UCS-2 (16 bit), right? Why should it
need more than 4 bytes per character in UTF-8 encoding?
And you can't just declare ASCII obsolete. It's just fine for English, and
face it: Any serious programming has to be done in English nowadays (at
least Open Source programming), and I doubt this will change in the
foreseeable future.
Maybe you should have a look at GTK+ and
Pango. GTK+ 2.0 uses UTF-8 for all
text now.
[reply]
[top]
[»]
Re: It's not that bad, actually
by Murray Cumming - Jun 15th 2002 07:00:57
> Unlike UTF-8, we don't want different
> sizes for Western and Eastern
> characters; that makes programmers
> unhappy and software difficult to
> control. Also, UTF-8 emphasizes
> historical Western domination of
> computing science, which is not very
> friendly.
I don't think it's unfair to give greater importance to alphabetic, or
even syllabic, writing systems. We want to support those other complex
character-per-word character sets, but they are purely "legacy"
languages. That's not cultural bias - that's system analysis. This opinion
is supported by the popularity of romanized japanese as an input system
even among native speakers of japanese.
[reply]
[top]
[»]
Re: It's not that bad, actually
by Bill Spitzak - Aug 6th 2002 03:00:06
More importantly, even real Japanese or Chinese has so many spaces, digits,
control characters, punctuation, and imbedded latin text, that it will be
shorter in UTF-8 than in UCS-16 or any of the other proposed encodings.
There is no bias whatsoever in UTF-8, it really is a crude form of Huffman
encoding. I also see no reason why a Chinese word that translates to a
several-character word in english must be stored in one-character of space,
if anything you are presenting a reverse-bias.
[reply]
[top]
[»]
Re: It's not that bad, actually
by Nudge - Jun 15th 2002 07:08:10
>
>
> Also, it's just not true that Unicode
> text files start with a special
> character sequence. That might be a bad
> Windows habit, but it's not required by
At least with 16bit Unicode, there little and big endian is marked. I
don't know exactly the sequence, it's a 16bit (2 bytes) scheme that
differs between Windows and Posix.
> That's just nonsense. UTF-8 is not hard
> to program for. And if you don't want to
> Trust me, it's going to become the
> standard on GNU/Linux systems in the
> near future.
> And you can't just declare ASCII
> obsolete. It's just fine for English,
That's all?
> and face it: Any serious programming has
> to be done in English nowadays (at least
That's not a serious argument. My article is about
global thinking, and all you talk about is English
in every direction.
> Open Source programming), and I doubt
> this will change in the foreseeable
> future.
That's exactly the point. If we are not willing
to leave ASCII behind us, there will never be
a clear encoding scheme, and my point was that,
when character sizes differ from one language
to another, there will also be more difficulties than
to use a standardized 32bit character width.
And, if you have an array containing characters
(a string), then every data unit is of the same
size - or not? So you have to convert it internally
to same bit width anyway to work with...
> Maybe you should have a look at GTK+ and
> Pango. GTK+ 2.0 uses UTF-8 for all
> text now.
Maybe we use UTF-8 in 2 years everywhere.
And after 2 more years, we will again ask ourselves
why there are different sizes for characters, and
then we come up with this "historical" stuff once
again and again and again...
And - who wants to use a library to write simple
strings to a file? We put so much energy in these sophisticated libraries
but have no power to overcome the past mistakes to avoid the whole
workaround.
What about the central parsing engine idea?
>
>
[reply]
[top]
[»]
Re: It's not that bad, actually
by Daniel - Jun 15th 2002 07:31:04
> At least with 16bit Unicode, there
> little and big endian is marked. I don't
> know exactly the sequence, it's a 16bit
> (2 bytes) scheme that
> differs between Windows and Posix.
I'm talking about UTF-8 text files.
> > And you can't just declare ASCII
> > obsolete. It's just fine for English,
>
> That's all?
Yes, that's all.
> > and face it: Any serious programming has
> > to be done in English nowadays (at least
>
> That's not a serious argument. My article is about
> global thinking, and all you talk about is English
> in every direction.
I'm talking about source code, not the user interface. All pilots use
English to communicate with each other, for good reasons. The same is true
for programmers. You need a least common denominator.
> > Open Source programming), and I doubt
> > this will change in the foreseeable
> > future.
>
> That's exactly the point. If we are not willing
> to leave ASCII behind us, there will never be
> a clear encoding scheme, and my point was that,
> when character sizes differ from one language
> to another, there will also be more difficulties than
> to use a standardized 32bit character width.
> And, if you have an array containing characters
> (a string), then every data unit is of the same
> size - or not? So you have to convert it internally
> to same bit width anyway to work with...
No. Quite the contrary: UTF-8 is meant for text, not single characters.
Perhaps you should make yourself familiar with UTF-8. See http://www.cl.cam.ac.uk/~mgk25/unicode.html
.
> Maybe we use UTF-8 in 2 years everywhere.
> And after 2 more years, we will again ask ourselves
> why there are different sizes for characters, and
> then we come up with this "historical"
> stuff once again and again and again...
If we go for fixed-length characters, maybe we'll ask ourselves then why
we only have 32 bit...
> And - who wants to use a library to
> write simple strings to a file? We put so much energy
> in these sophisticated libraries but
> have no power to overcome the past
> mistakes to avoid the whole workaround.
You won't need a library to write simple strings to a file.
[reply]
[top]
[»]
Re: It's not that bad, actually
by Nudge - Jun 15th 2002 08:01:14
> I'm talking about source code, not the
> user interface. All pilots use English
Why, is there a difference? Languages should
be available for the console/terminal, imho. So
one can program in non-ASCII. I don't see any
difficulties besides the recoding of given sources.
The second thing I want to point out is, that a GUI needs, for example,
the next char |