fmII
Thu, Jan 08th home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop 03:41 UTC
in
Section
login «
register «
recover password «
[Article] add comment [Article]

 A Proposal for a True Internationalization
 by Mathias Hartwig, in Editorials - Sat, Jun 15th 2002 00:00 UTC

7bit character streams are the most secure against misinterpretation. When I send email, I leave everything else out (though we Germans have other symbols, too), just in case there is a machine that cannot handle it. It has become a whole lifestyle; to write everything in lower-case letters and put a smiley on the end of the line seems to express global thinking. But if we are honest, we know it expresses nothing but a deficiency in modern character processing. This heritage of the 70s (the start of Unix system distribution) is a hard hurdle to overcome. Fortunately, the need for usability is getting stronger and the pride of programmers and administrators is getting weaker. As a student of the Japanese language, I went trough many sleepless nights setting up user variables, input parsers, and terminal stuff, so I think I know the difficulties. With this article, I will try to express a proposal coming from both sides in me, the programmer and the user.


Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly.

Today, it is quite common that people from different countries are working together. Also, though every employee may have his or her own terminal, it's likely that common applications are provided by a single file server. This creates a need for multi-language operating systems. It sounds critical, but it isn't (yet). The commands are exactly the same (Chinese sysadmins also type "mount", though I can imagine language-dependent symlinks) and answers may vary only a little (e.g., "y/n" in German is "j/n"). Until now, all other tasks of internationalization have been avoided by the system, and applications have to take up the slack. If they don't (if, for expample, the administrator forgot to install the appropriate .po files), you can be lost on a terminal with pictograms that, to you, mean nothing.

The i18n movement which started some years ago solves a lot, but not everything. With it, only output is guaranteed to match the best gettext will find. What about the input? Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit environment, are hard to handle and crack easily (if you press the delete button, it removes only half a sign). kinput2 and ami cannot run together in one terminal, because code pages intersect. Start and end sequences are one solution, but a bad one and one especially not meant for the long run. Imagine a document full of different languages; if I want a function that gives a line length for this doc, it will be the hell, and I haven't even mentioned what will happen when new languages with new start and end sequences are implemented.

Also, we have so many applications which handle text and formatting. Integration of multiple language parsers into them may take 5 times more than implementing the problem-specific algorithms. I think something like Microsoft's IME, a central (system-wide) solution, is needed here. Unfortunately, IME is not Open Source, and is therefore un(sup)portable.

Next problem: Character encoding. Oops, this discussion is as old as computers are. Every nation had its own coding scheme, using the same domains! What a crappy idea! How could somebody let this happen?! OK, you say we have Unicode. Unicode was a good idea, until they found that 16bit is too little. Also, look at Yudit's encoding list; there's not one single Unicode, but many: UTF-7, UTF-8, UTF-16, etc. Furthermore, Unicode text files have a starting sequence, and Windows saves Unicode with low-hi byte order, but Posix systems don't. Java uses wide characters (16bit) internally. Wow! Now it means nothing. 16bit is just too little; it was only for the short run.

Next problem: The console. Its fonts and behavior differ from those of X (which, ten years after the invention of TrueType fonts, still lacks correct handling; take a look at Abiword, and you will understand). If I were Chinese, I would want to also see Chinese on my console, but this is even harder than under X, not to mention input routines. But what's the difference for an input parser between X and the console?

Next problem: Somebody better stop me from complaining. We have to move on. We still use the old stuff, but are now saving in XML. This is not very revolutionary. I will try to take a step forward. I'd like to present a solution. It's time to think about an all-inclusive, simple, and working system design.

But first, again, a collection of the problems mentioned above:

  1. Machines are 7bit- or 8bit-oriented, and the input is, too. This is historical, but we have to overcome the compatibility paranoia, or there will be no progression.
  2. Code pages intersect, and we have to improve the Unicode scheme.
  3. We want input parsers on the operating system level, at a central place, which serve both X and the console. Also, a basic font that holds all characters for both X and the console is wanted (I don't like those question marks).
  4. We want neither start nor end sequences.
  5. We want user realtime language switching for input (and maybe for output, too).

Fortunately, there are now these advantages which we can use:

  1. We process addresses and integers as 32bit, we have 32bit buses, and next generation computers will have full 64bit architectures.
  2. Applications do not contact keyboard codes; they already get the values through the system (nearly unprocessed, but at least a keymap is used).
  3. Computers are quicker than ever, giving us enough time to parse the input correctly.
  4. So many routines for combining characters have already been written.
  5. We have enough memory for the BFF (Big Fucking Font).
  6. We have enough hard drive space for the text files.

Especially when I think about points 5 and 6 of the advantages, I say: Why bother? Let's give it a try. What I propose is first a new char type that consists of 32bit lengths. This will give us the security that in the future, no characters of any language will be outlawed. The most low-level routines (that write to the buses) will have to be changed. Upper-level APIs may stay the same (as user programs do), as long as they do not play with overflow (255 + 2 = 1) calculation. And, for heaven's sake, I propose to use only 7bit of the 4 bytes. Still, we would have around 270 million signs available. You might say "That's way too much; 3 bytes, like for my display, is enough!" Well, there are sound cards that process 24bit, but the processor has to pack it into 32bit packages to enhance speed, so in the end, there's no real advantage to 24bit. Also, in another 100 years, there might be the need for more. Please throw away the idea that you will see 4 bytes when you open a terminal! A char (you could call it sign or foo or bar if you like) will be an atomic piece of data. This view also fits in the modern multimedia processing arena, where sound data consists of 2-byte 2-channel or, for studio work, 24/32-byte multi-channel data structures. Binaries consist of 32bit, as do video streams, the most complex data we know today. The text file just hindered us taking this revolutionary step.

If you think this will blow up your filesystem, you are most likely wrong. Take the sizes of your text files, multiply them by 4 (or 2 if you are using CJK encoding text files), and compare them with your wav, MP3, or DivX files' sizes. Those files will not get bigger. The 7bit style is for old Internet routing hardware, but I think that in another 100 years, it won't still be there. Then, these domains may also be used. The encoding scheme is clear: As the number, so the saving and loading. No conversion. Hi-low byte order is preferred and seems more logical. We could use what Unicode did for 16bit, seamlessly integrating domains, allowing enough room between language domains. Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.

Let's go on. There will still be a mapping between keyboard-sent codes and the 32bit chars attached to them, as a phase of preprocessing. The next step will be checking the user's input choice and sending the data to the parser. This parser will build buffers for input, syllable buffers, chosen readings, etc. The buffers will belong to the system. They will be cleared when switching between parsers, but we need the ability to foresee what we type. Under X, window managers may have a small buffer dock app in which you could see the language symbol of what you're typing. Take a look at IME, and you'll know what I mean. It might be more difficult in the console, but with libs like ncurses, there might be a way to give a better view on the writing.

Also, shells might stop character echoing and write buffer contents instead, then clear back to the last breakpoint, write the new sign, change buffers, and write them again after the new sign. I did this once with a small Japanese console learning app, and it's fast enough that you don't see it. When pressing the enter key, I propose that we use only ready-typed input; otherwise, some shells might want to do so and some do otherwise, not giving a standard behavior.

Now I will think about one of the most-feared things in the computer world: The change of data structures and backward compatibility. First, a single machine holds its data -- text files, binaries, etc. It is most likely connected to other machines or the Internet, and that's where I begin. These new generation operating systems that fully process 32bit data from hard disk to whatever will be compiled and booted on machines, but will get data through FTP or other services that will get tons of chars of the old type. Their buffer (char[]) will be a 32bit[] array, provided by the system (because sizeof(char) returns 4!). The service now writes it to disk, but, fortunately, the system does it for us, because the system always has a fear that some applications might damage the hardware. For the newly-installed machine, there is no difference in behavior except the file sizes for texts. When the new machine provides services, there will be the problem of sending too much information. If the client wants the next byte (or char), there might be a problem if it gets a value above 255 (or 127 signed) and dumps an error message or disconnects.

In the end, there will be no progression without backward compatibility problems. Network connections are a big advantage here. We should try to use it and finally throw away our fear, because the gain will be a clear character processing solution that works world-wide, with no more hassles with encoding schemes and browser displaying problems, and a user-friendly, simple-to-use, speedy and secure multi-language interface. Also, encoding information can be left out, which cleans up email and XML files. The first big step will be to change low-level system routines, and for that I wish us all some more courage towards a change of thinking.


Author's bio:

Mathias Hartwig was born in Leipzig, Germany. He has studied Japanese and Computing Science, and a bit of Korean and French. He works as a programmer at a University Hospital. He started programming with GW-Basic and Turbo Pascal 5.0 when he was 12, and later used Delphi, Java, C++, and C, which he still finds to be the best. His first Linux distribution was a Slackware 3.0, and he remembers his first two hours looking at the console, typing commands like "DIR", with nothing happening. He now uses Mandrake Linux both for work and private use, and starts Windows only to recompile his projects with MingW's GCC. Most of his projects deal with language learning.


T-Shirts and Fame!

We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

[Comments are disabled]

 Referenced categories

Topic :: System :: Console Fonts
Topic :: Text Processing :: Fonts

 Referenced projects

gettext - The GNU internationalization library.
JLearner - A Japanese character learning tool.
Minivoki - A vocabulary learning aid.

 Comments

[»] In the middle of secrecy…
by drg001 - Sep 13th 2002 01:34:13

In the middle of secrecy…
(Sometimes, a smile can do more …)

Warning:
You can read further only if you can keep this information top secret from everybody, including your friends, and, especially from nobody. Do not copy this document and, God forbid, do not distribute it through e-mail, regular mail, by word of mouth or psychic ability, even if you do not have one.

http://www.tupbiosystems.com/articles/secret.html

[reply] [top]


[»] Japanese and input methods
by Ambrose - Aug 30th 2002 00:18:03

The author should get an update on existing projects (software and otherwise): mlterm (mlterm.sourceforge.net) can already use more than 1 input method. You can change the input method on the fly, and it will do charset translation for you (so you can, e.g., use a Japanese input method to type Chinese words, or vice versa, if Unicode thinks they are the same). If you like to be confused, you can also change the charset on the fly too. And yes, there is such a thing as a Chinese console.

As for the future direction on the interoperability of input methods, work is already started on the implementation of IIIMF, the next-generation input method framework that takes even Microsoft Windows into account.

Of course, the reality now is that most programs cannot use more than 1 input method. If waiting for IIIMF is not realistic (and it is not realistic in the short time or the medium term), instead of lamenting on the reality of the largeness of Unicode and the CJK charsets, we should try to hack on the problem and try to tackle making programs able to use more than 1 input method. Perhaps it would be possible to create "input method proxies" that can call other input methods and translate the input to the target locale; the existence of mlterm shows that it is possible for a program to call more than one input method, and I see no reason why that program cannot be itself an XIM server.

(In fact, because of mlterm, I am trying to look for XIM servers for other writing systems such as Eastern/Western European, Greek, Russian, etc. But I can't find them. Perhaps only the people who frequently have to use them knows where these things are.)

We Chinese used to think that our language is not "scientific", and this caused our writing system to be despised and sabotaged by our own people. (I admire the Arabs and the Jews, that they can keep their writing systems r-to-l despite Western influence.) However, this is wrong because we can use input methods to enter Chinese, and the speed of Chinese touch typing is comparable to English touch typing. It is not productive to say a language or even a writing system is unscientific; even among alphabetic writing systems, the only truly scientific alphabet is Korean hangeul (if you don't believe this, dig through sci.lang archives).

[reply] [top]


[»] a whole lifestyle
by barrett9h - Aug 12th 2002 17:42:25

It has become a whole lifestyle; to write everything in lower-case letters and put a smiley on the end of the line seems to express global thinking.
indeed. =)

But if we are honest, we know it expresses nothing but a deficiency in modern character processing.
nope. i don't feel like that. it's really a differentiated form of comunication. for ex, i speak portuguese, and we have many accented characters. but i don't use any, not just because it's incompatible with some systems, but also because it's harder to type. (you see, i don't even use a capital 'i' for the 'i' word..) i just use some adaptation, when apropriated.. (for ex. im portuguese "e" and "é" are two different words, and i type them as "e" and "eh". everybody understands..) and don't care about being gramatically correct anyway. the important thing is to comunicate, and i type much like the way i talk..

i also find text mode smileys way nicer then the graphicals ones..

[reply] [top]


    [»] Re: a whole lifestyle
    by Ambrose - Aug 30th 2002 00:23:40


    > i also find text mode smileys way nicer then the graphicals ones..
    yes, text mode smileys are way nicer than graphical ones.
    The graphical smileys are *ugly*. And they intefere with the
    normal (Chinese/Japanese) smileys I use. I almost always
    turn graphical smileys off if a program "supports" them.

    [reply] [top]


    [»] Re: a whole lifestyle
    by Hisham Muhammad - Sep 3rd 2002 17:07:57


    > i just use some adaptation, when
    > apropriated.. (for ex. im portuguese "e"
    > and "é" are two different words,
    > and i type them as "e" and "eh".
    > everybody understands..) and don't care
    > about being gramatically correct
    > anyway.

    Man, this is ugly. This is just like using
    "naum" instead of "não". It is, like somebody
    else said in this thread, a lack of self-respect
    towards one's own language.


    > the important thing is to comunicate,
    > and i type much like the way i talk..

    Yes, this is happening more and more. And it
    is very unfortunate. The quality of written text is
    decreasing. A friend of mine even wrote "aí eu
    peguei e fui lá"... :(

    [reply] [top]


      [»] Re: a whole lifestyle
      by barrett9h - Sep 4th 2002 11:14:15

      That's true, but I only use this kind of writing when I'm "talking" to my friends on email, irc, icq.
      When writing a letter or something more important I like to write the "right" way, and I still know to write correctly when I need to... :)
      The language is always envolving through history, and this (net talk) is just one more adaptation.

      [reply] [top]


[»] Your probably right but I think->
by Butuque - Aug 9th 2002 00:00:14


I can post this with all the crap already on here...

>The i18n movement which started some years ago solves a lot, but not everything.
>With it, only output is guaranteed to match the best gettext will find. What about the input?
>Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit
environment,
> are hard to handle and crack easily (if you press the delete button, it removes only half a
sign).
> kinput2 and ami cannot run together in one terminal, because code pages intersect. Start
and end
> sequences are one solution, but a bad one and one especially not meant for the long run.
>
> Imagine a document full of different languages; if I want a function that gives a line length
for
> this doc, it will be the hell,

I say that this is too bad. But if you do make one, make one that I can easily store like this.
Byte order be damned - eveyone has to do the work.

>and I haven't even mentioned what will happen when new languages
> with new start and end sequences are implemented.
I don't see what will happen.

>Also, we have so many applications which handle text and formatting.
> Integration of multiple language parsers into them may take 5 times more than implementing
> the problem-specific algorithms. I think something like Microsoft's IME, a central
(system-wide)
> solution, is needed here. Unfortunately, IME is not Open Source, and is therefore
un(sup)portable.

Are we supposed to look at what is out there?
What does IME do?


I don't have the time to check this out, but:

Lets look at it like this::

I want to keep as much of my text file in memory at once as I can. I am writing a spell
checker and
and it can't take up my whole machine. I want to be able to look up Western words and
Eastern words.
They are in my database and I want that in memory as much as I can fit.

: I want to store as Multi Byte Characters (like MBCS only 64 bit).
: most of my Data can be compressed because I am a Pentium 4 on a programmers desk
and have
CPU cycles to burn.
: I need to be able to traverse this with a pointer.
: I am going to traverse it only once anyway, no need to expand it.
: My data is in records (equiv. to LineFeed, '\0' at end of string, etc.)


C++ will do the job, but I have to use functions on my pointer class.
I want to write this code but once, and be able to read this stuff forwards and
backwards in any \
programming language.


Sounds like CORBA implementation of an interface called the characterPtr is in order that
everyone will use.

Things I need:


I use things like / and \ and + and ~ that everyone with a computer likes to use.
These would be nice to fit in with 1 byte characters so I can say in my compression that:

the next 116 characters are going to be single byte characters.
These would also be nice to fit in with 2 byte data of the next set of 115 charactrer.


My program want to recognize this character, so I will call it:
1111 1111 1111 11XX (substitute the ascii code for '/' for XX) (8 bytes)

it can be stored as :
XX
11XX
XX11
1111XX
XX1111

And, I already know how many bytes long it is, and the Motorolla or Intel storage format (my
'compression' told me this).

When I look at it using my characterPtr class, it always looks like 1111 1111 1111 11XX.

It is always kept stored compressed. I have to traverse it.

Problem : I don't use CORBA, or I can't because I program in bash.

Solution : write the equivalent of the characterPtr in your language to access te whole
world
of the newly defined character set that is always, wherever it is stored or transmitted,
compressed
using the same compression scheme:

If I have data that is 100 strings each an average length of 16 characters and all
characters are the 7 bit,
each string takes up about 20 bytes is encoded The extra 4 characters tell me length of
the string and that
it is all single byte and stored in intel format (don't matter for single byte, but I know
anyway).

Everyone in the world knows how to read those extra 4 characters and so to them (us) this
data looks like
sets of 64 bit characters.


Solution 2 : Write a String class that uses the base of this pointer so that it can traverse
the compression backwardly.


The computer can no longer define the language, we must do that part.










--
// // Butuque //

[reply] [top]


    [»] Re: Your probably right but I think->
    by Butuque - Aug 9th 2002 02:47:12

    Well, I wish I hadn't posted that. I still have to consider adding to the compressed data, etc. Let's just forget I posted at all, shall we? ;-) Butuque!

    --
    // // Butuque //

    [reply] [top]


[»] please, jeff
by Robert Trebula - Jun 26th 2002 08:58:10

please, wipe out the whole comments board, ban access to fm-comments to this idiot and let's start the discussion again.
it's causing me a headache reading these outputs of some brain-dead child...
thanks...
Robert
PS: read it through and you'll soon find out who am i talking anout

[reply] [top]


    [»] Re: please, jeff
    by jeff covey - Jun 26th 2002 10:14:58

    Trolls only stick around as long as they're fed.

    --
    vs lbh pna ernq guvf, lbh'er n trrx.

    [reply] [top]


[»] What happend to rxvt?
by linuxknight - Jun 24th 2002 18:32:18

If you go to rxvt.org, you notice that their web site hasn't been
updated since 2000, and, they say the last release was 2.7.3.
But, if you go to ftp.rxvt.org, you see rxvt 2.7.8 was released in
2001. And, if you go to their cvs on sourceforge, there were
updates until 2 months ago. That's weird. There was no
indication on their web site that the project was dead, but at
first i thought: maybe the project died without notice. But, when
I saw the ftp site had been updated until 2001, that no longer
made since. What happend to rxvt?

--
Signed, Linuxknight

[reply] [top]


    [»] Re: What happend to rxvt?
    by Ambrose - Aug 30th 2002 00:26:53

    The official web site for rxvt is at rxvt.org. Please look for updates there first.

    [reply] [top]


[»] ASCII
by linuxknight - Jun 20th 2002 19:01:00

Come on, why don't you admit that ASCII has it's
uses? I know we need unicode to allow people who
speak non-roman languages to speak in those
languages, but ASCII is good for the people who
don't speak those languages. I will admit ASCII
is not all we need, but I do think it has uses.
Like, if you use english for your documents, ASCII
contains the character you need. But it does
include german characters. I saw a web page with
german characters in lynx. I pasted the characters
into vi. It worked. All the chars were still
there. All of them. If german characters appear in
lynx, wouldn't that prove that german characters
were supported by ASCII? Not only that, but I
saved that document. Reopened it. All the german
chars were still there. If you make a new
standard, that's fine. Just make sure you keep
it compatible with text mode. So that people can
still use lynx.

--
Signed, Linuxknight

[reply] [top]


    [»] Re: ASCII
    by Daniel - Jun 20th 2002 19:54:50

    locale -k charmap

    I pretty much doubt the output is charmap="ANSI_X3.4-1968" (that would be pure ASCII).

    [reply] [top]


      [»] Re: ASCII
      by linuxknight - Jun 20th 2002 22:41:20


      >
      > locale -k charmap
      >
      > I pretty much doubt the output is
      > charmap="ANSI_X3.4-1968" (that
      > would be pure ASCII).
      >
      >
      The output was: charmap="ANSI_X3.4-1968" The two german characters were: Wait a minute! When I typed out the cat command on the console to open the file that had the german chars, they displayed on the console, but when I tried to paste them into this message, it only showed up as two question marks! But, when I pasted it into an another xterm running vi, the chars appeared. And, when I opened the file using Nedit, the chars appeared, and I was able to paste them into this message, but when i tried to paste it from Nedit to the Xterm, they appeared as two question marks! It seems that the german chars can only be pasted from one Xterm to another, or from one X application to another. But, the german chars showed up in the file and lynx outside X. What happend?

      --
      Signed, Linuxknight

      [reply] [top]


        [»] Re: ASCII
        by Daniel - Jun 20th 2002 23:03:29


        > The output was:
        > charmap="ANSI_X3.4-1968"

        You're running in the C locale. It only worked because the characters weren't checked for correctness and sent directly to the console, which probably understands ISO 8859-1 by default.


        > But, the german chars showed up
        > in the file and lynx outside X. What happend?

        That's exactly the kind of problem we're trying to fix -- I don't know what happened. This whole charset nonsense is way too complicated. LC_ALL=en_US might help you, dunno.

        [reply] [top]


          [»] Re: ASCII
          by linuxknight - Jun 20th 2002 23:36:56


          >
          >
          > % The output was:
          > % charmap="ANSI_X3.4-1968"
          >
          >
          > You're running in the C locale. It only
          > worked because the characters weren't
          > checked for correctness and sent
          > directly to the console, which probably
          > understands ISO 8859-1 by default.
          >
          >
          > % But, the german chars showed up
          > % in the file and lynx outside X. What
          > happend?
          >
          >
          > That's exactly the kind of problem we're
          > trying to fix -- I don't know what
          > happened. This whole charset nonsense is
          > way too complicated. LC_ALL=en_US might
          > help you, dunno.
          >
          >
          I'm sorry, I REALLY thought german characters were supported by ASCII. What is ISO-8859-1? Is it a subset of ASCII that supports german chars? But, I was sure ASCII supported german chars, I saw german chars on the console. So, the console can display chars outside ASCII? Well, I thought the console could display only chars in ASCII. My console can display chars from all european languages, but not from any non-roman languages. So, we don't need ASCII to keep console mode? Since consoles can't display any japanese chars, what should we do? get rid of the console and only use X? But, you can't get into X without the console. That means we wouldn't be able to use any OS based on a console anymore. Does this mean linux is dead? Are we doomed to use Windows XP? If you include japanese chars in a new standard, the console is dead, and linux is dead. But, if you exclude non-european characters from a new standard, even if it were 8 bytes per char, linux & the console will still work. In other words, we need a new standard, but we need to keep it compatable with the existing consoles.

          --
          Signed, Linuxknight

          [reply] [top]


            [»] Re: ASCII
            by Daniel - Jun 20th 2002 23:50:27


            > supported by ASCII. What is ISO-8859-1?
            > Is it a subset of ASCII that supports german chars?

            ASCII is a subset of ISO 8859-1, which is the most commonly used 8bit extension character set in Europe. But don't think it covers all European languages.


            > In other words, we need a new standard, but
            > we need to keep it compatable with the
            > existing consoles.

            Believe it or not, I'm running UTF-8 in my console:

            locale -k charmap
            charmap="UTF-8"

            [reply] [top]


              [»] Re: ASCII
              by linuxknight - Jun 21st 2002 00:25:26


              > Believe it or not, I'm running UTF-8 in
              > my console:
              >
              > locale -k charmap
              > charmap="UTF-8"
              >
              >
              UTF-8 is a good idea! But, how did you change you
              console to use UTF-8?

              --
              Signed, Linuxknight

              [reply] [top]


[»] ASCII
by linuxknight - Jun 19th 2002 01:54:26

Well, i've still got one point: ASCII is capable of supporting the roman alphabet. For example, I went to a german web site, found some german characters, pasted it into an xterm running vi, saved it. closed vi. opened vi with the file. The german characters were still there, ü for example, was still there. So, if ASCII doesn't support that character, that would be impossable. So, ASCII must support that character, right? Even though your point about people who speak some language like arabic, japanese, etc. and a roman language may be true, but, there are a lot of people who speak only roman-based languages. So, people who only speak roman based languages should not have to make their files 4 times bigger for the others, should they? So, UCF-4, etc. already exist. So, if you plan to speak non-roman languages, use it. I dislike non-roman languages.

--
Signed, Linuxknight

[reply] [top]


    [»] Re: ASCII
    by Mirza Hadzic - Jun 19th 2002 05:21:19

    You was just lucky enough. ASCII chars are 0-127 and "Umlaute u" you mentioned is naumber 129 in many codepages. So, it happens that in several codepages it is u but in some countries it can be anything else depending of codepage used. Codepage is the way how these chars (128-255) are interpreted. In Czech Republic we have many codepages, among others old "Soviet" style KOI codepage, Kamenicky (also obsolete), Latin-2, ISO, Windows-1250 to name a few :-). So we even have czech -> czech text file conversion programs.

    [reply] [top]


      [»] Re: ASCII
      by linuxknight - Jun 19th 2002 16:23:50


      > You was just lucky enough. ASCII chars
      > are 0-127 and "Umlaute u" you mentioned
      > is naumber 129 in many codepages. So, it
      > happens that in several codepages it is
      > u but in some countries it can be
      > anything else depending of codepage
      > used. Codepage is the way how these
      > chars (128-255) are interpreted. In
      > Czech Republic we have many codepages,
      > among others old "Soviet" style KOI
      > codepage, Kamenicky (also obsolete),
      > Latin-2, ISO, Windows-1250 to name a few
      > :-). So we even have czech -> czech text
      > file conversion programs.

      Is Czech a european language? Well, anyway, I included
      support for every codepage in my kernel. Isn't there a
      codepage that supports all of the western alphabet?
      I think we have space in ASCII to add all of the extra
      characters in european languages. And, why would ASCII be
      127 chars? Those extra chars in german, french, spanish, etc.
      that were not included in nomal ASCII could be added to the
      128-255 gap. I think they would fit. Most of the chars used in
      european languages are already in ASCII, so all we have to
      do is add the extra chars from european languages. That
      would be possable.
      P.S Did you know that the german word for cat is katze?

      --
      Signed, Linuxknight

      [reply] [top]


        [»] Re: ASCII
        by Mirza Hadzic - Jun 19th 2002 16:57:13

        % Is Czech a european language?
        > ???
        > codepage that supports all of the
        > western alphabet?

        It is hard to say what is "Western" alphabet. Czech alphabet is as western as English, so is Swedish, Hungarian, Polish, France, Turkish... even Bulgarian or Greek which have totaly different layout of letters. So it is much more then 256 chars and that's a problem. You cannot skip any language becouse there are institutions like EU which consider all these languages equal and they want to use several languages inside single document.

        [reply] [top]


          [»] Re: ASCII
          by Axxackall - Aug 4th 2002 21:24:45

          Don't divide the world like Hitler did. ASCII is for nazi-thinking people.

          I support UTF-8 - that's the future for the world where the half of people speak languages which are not compatible to ASCII.

          There is still another and much bigger problem yet to solve: timezones.

          By the way, does anyone know anything like UTF-b but targeting the chaos in timzones?

          [reply] [top]


        [»] Re: ASCII
        by Ambrose - Aug 30th 2002 00:41:36

        If you read comp.fonts (through Google Groups I suppose, to dig out the old old articles), you may realize that ASCII does not even support English. Why? Because there is no code point for the two kinds of dashes that are required by grammar, and no distinction between opening and closing quotation marks. Worse (contrary to what some people think and try to make other people think the same), some code points (e.g., apostrophe and grave accents) have valid alternative meanings (e.g., closing and opening single quotation marks). Some English words require accent marks. And the "Icelandic" thorn and eth letters used to be English letters a very long time ago.

        IMHO, the decreasing quality of English punctuation use is directly attributable to the spread of computers.

        The charset conversion problem is not Unix-specific. Users of Chinese or Japanese Windows / Macintosh see it all the time. English-speaking people were just not used to seeing this.

        [reply] [top]


    [»] Re: ASCII
    by Miles - Jun 19th 2002 17:31:02


    > Even though your point about people who
    > speak some
    > language like arabic, japanese, etc. and
    > a roman language
    > may be true, but, there are a lot of
    > people who speak only
    > roman-based languages. So, people who
    > only speak roman
    > based languages should not have to make
    > their files 4 times
    > bigger for the others, should they? So,
    > UCF-4, etc. already
    > exist. So, if you plan to speak
    > non-roman languages, use it.
    > I dislike non-roman languages.

    A couple of things. First of all, the most spoken language is not English. It's Mandarin (Chinese). There is obviously a place for non-ASCII character encodings.

    Second, text files (as mentioned in the main article) are usually in the minority for space used on personal computer systems.

    Third, there is nothing stopping you from running something like gzip on the text files which will not only get rid of the multi-byte tax, but they'll end up smaller than the equivalent ASCII file.

    Fourth, I don't think it's really an issue for most folks when 100GB hard drives are becoming normal; that's a whole hell of a lot of text, multi-byte or not.

    Fifth, it's UCS-4. Heh heh... nitpicking. ISO/IEC 10646 encoding form: Universal Character Set coded in 4 octets. UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form.

    Sixth, this discussion is from a programmer perspective and not really a user perspective. When you say that you dislike non-roman languages, I assume that you have no real experience with non-roman languages. That's fine I guess, but are you willing to state that none of the users of the programs you write like non-roman languages? This is the real issue.

    Finally, if you use UTF-8 on your system, you will see no appreciable amount of wasted space over ASCII, but the potential to hold most other characters is still available. And as an added bonus, all of your standard ASCII documents are still valid. There is very little excuse to only support western characters when such an easy alternative is available.

    [reply] [top]


      [»] Re: ASCII
      by linuxknight - Jun 20th 2002 22:57:26


      > issue for most folks with 100GB hard drives

      My computer does'nt support 100GB hard disks,
      I only have a 20GB hard drive. Am i supposed
      to pay $6000 for a computer with a 100GB hard
      drive?

      --
      Signed, Linuxknight

      [reply] [top]


        [»] Re: ASCII
        by Miles - Jun 21st 2002 14:02:26

        Where do you live that a new computer costs $6000? A new motherboard here costs anywhere from $80-$200. A whole new (very fast) computer can be purchased for less than $1000.

        But you're right. A lot of people still have 20GB. That's only about 20 billion* characters of text (ASCII or UTF-8 in a western country) give or take a couple of billion for program data. Now let's move to UCS-4: still 5 billion (give or take) characters. Now let's compress that with something like gzip: getting closer to 150 billion characters.

        I fail to see your point.

        * U.S. billion -- thousand million in some other locales.

        [reply] [top]


          [»] Re: ASCII
          by linuxknight - Jun 21st 2002 18:59:59


          > Where do you live that a new computer
          > costs $6000? A new motherboard here
          > costs anywhere from $80-$200. A whole
          > new (very fast) computer can be
          > purchased for less than $1000.
          > But you're right. A lot of people still
          > have 20GB. That's only about 20
          > billion* characters of text (ASCII or
          > UTF-8 in a western country) give or take
          > a couple of billion for program data.
          > Now let's move to UCS-4: still 5 billion
          > (give or take) characters. Now let's
          > compress that with something like gzip:
          > getting closer to 150 billion
          > characters.
          > I fail to see your point.
          > * U.S. billion -- thousand million in
          > some other locales.
          >

          I didn't mean 20GB wasn't enough for UTF-8. I've started
          using UTF-8 now. UTF-8 is a good idea, because all ASCII
          chars will still be ASCII, but other chars would be encoded as
          2-4 bytes. Good deal. But, you shouldn't assume everyone
          has 100GB hard drives. I only got a 20GB drive 1 year ago.
          Before that, I only had a 3GB drive. The only reason I got a
          20GB hard drive was because the 3GB one wore out. My old
          486DX still uses a 1GB hard drive. As for your question about
          where I live: Dallas, Texas, USA

          --
          Signed, Linuxknight

          [reply] [top]


        [»] Re: ASCII
        by MikeFM - Jul 20th 2002 01:51:59


        >
        > % issue for most folks with 100GB hard
        > drives
        >
        >
        > My computer does'nt support 100GB hard
        > disks,
        > I only have a 20GB hard drive. Am i
        > supposed
        > to pay $6000 for a computer with a 100GB
        > hard
        > drive?
        >

        My computers are mostly ancient P120's and several of them have drives larger than 100 gigs. Most computers can support these sizes ofd rives if you upgrade their bios. In cases where they still have problems the hard drive companies typically have a utility you can run that'll run before your OS that will enable support for the large drives. I've yet to find a Pentium or newer computer that couldn't support any size drive I slapped into it.

        [reply] [top]


      [»] Re: ASCII
      by Gene Montgomery - Jun 29th 2002 13:11:28


      > A couple of things. First of all, the
      > most spoken language is not English.
      > It's Mandarin (Chinese). There is
      > obviously a place for non-ASCII
      > character encodings.

      Mandarin Chinese may have the largest number of persons who speak it as their first language, however, It is debatable as to whether Chinese is the "most used" language in the world.

      See

      As a reasonably well-traveled individual, I never cease to be amazed at the number of countries I have visited where English (many times with an accent) is available to the locals - and that applies to the Middle East, South Asia, Japan, Korea, Viet Nam, the Phillipines, Holland, Scandinavia, parts of Africa, and other places. And in none of these locations would English be classified as the "native" tongue. Yet in some, it is preferred to the local language(s) or dialect(s) because of its universality.

      In the computer software discipline, the employment of English (or a language with primitive elements derived from English/Romance words, such as Fortran, Pascal, Ada, C, C++, and so on) is pronounced. I know of no computer programming language wherein the base language (spoken or written) is Mandarin Chinese. Indeed, I believe that the Chinese Idiogrammatic language forms would prove difficult to use as the basic "alphabet" or symbology of a computer programming language.

      My experience is that even the Japanese, who type into word processors and personal computers, use an anglicization/romanization at the keyboard called "romaji" to enter the sounds of the Japanese language, which are then converted to Katakana/Hiragana, and/or Kanji, for output to the display. I have done it while in Japan, but it isn't easy for one whose Japanese is limited. However, it is second nature to PC-aware Japanese (and they are becoming intensely PC-aware).

      Although I have long since forgotten the details, I remember reading perhaps 40 years ago about an English person who invented a romanization system for the Chinese a long time ago, which sounded similar to the Japanese method of getting from Romaji to Kanji. IIRC, he did it to help little children learn sounds of the language, and enable them to bridge to the concepts in the ideograms. BTW, the Chinese ideograms, while extremely similar to the ones used in Japan, and may have similar meanings, rarely have even remotely similar pronunciation. So, a Japanese person can probably get the gist of a written Chinese ideogram, but would have to know (one of) the Chinese dialects to be able to verbalize the ideogram.

      I guess the point of this is that even in the Orient, where the idiogram reigns supreme with some of the literati and illuminati, they are forced to revert to the primitive 26-character English alphabet to enter computer input. So, when you think about i18n, I suggest thinking about the input issues as well as the output(display) issues, and I suggest that there are basic communication issues to be addressed - not just between peoples, but also between the human and the machine. For example, in Japan, romaji, and the English language is universally taught as a second language, and has been for a great many decades. It is necessary, since romaji is taken as a given.

      [reply] [top]


        [»] Re: ASCII - WOOPS my ref. dropped out...
        by Gene Montgomery - Jun 29th 2002 13:20:45


        > See http://iteslj.org/Articles/Kitao-WhyTeach.html

        [reply] [top]


        [»] Re: ASCII
        by Miles - Jun 29th 2002 13:58:36


        > Mandarin Chinese may have the largest
        > number of persons who speak it as their
        > first language, however, It is
        > debatable as to whether Chinese is the
        > "most used" language in the world.

        On the contrary, I do believe that it is most used. English, however, is most likely the most widely known and understood. English is by far the most widely used for commerce, but other than that, more often than not, people in different countries converse in their native language (they *use* a non-English language).

        But this is missing my point. For some curious reason, some folks have come to believe that I think that all English-based programming syntax (if/while/for) should be changed to Chinese pictographs. For the last time, hear this: I did not say this! I did not imply this. I implied that there is a sufficiently large number of people to warrant the processing and display of non-romanized text.

        I am aware of the use of romaji for input purposes in Japan. I am also aware that many of those uses of romaji also include a kanji/hiragana/katakana translation step. In other words, after the romaji is input, a list of applicable pictograph substitutes is presented for selection. This is not every case, but it is a very common case in my experience. To put it into context, it would be the equivalent of an English speaker always writing 'to' in their writings whenever 'to', 'two', or 'too' was intended. Sure the reader could figure it out, and they all sound the same, but wouldn't it be better in many cases to take the time and select the correct one? "I have to presents to give to my to brothers to." See what I mean?

        A keyboard does not have to pictograph-based in order for a computer to handle pictograph data. There are also advances in handwriting technology. Right now, it is not uncommon for reporters in Japan to hand-write their notes and transcribe them later instead of using a laptop and romaji translation because the laptop is slower for them. Handwriting recognition would remove this barrier. Of course, it would require that an i18n capable OS and editor is available -- hence the point of this discussion.

        Note: the point of this discussion is not that we sould scrap all keyboards in current use either.

        And yes, I am aware that while the characters of China and Japan are very similar, their speech, pronounciation, and cultures are very distinct. In fact, other posts of mine in this discussion have pointed out this fact; however as not all computers have text-to-speech engines and are visually accessed from a screen in most cases, the importance of being able to display and input the characters is still much more relevant.

        Yes, east asia uses the romanized alphabet extensively. Any visit to east asia, however, will demonstrate very quickly that it is not used to the exception of pictographs. There is demand out there to handle both.

        [reply] [top]


[»] Two steps approach
by Mirza Hadzic - Jun 18th 2002 09:40:53

1. Codepages *SUCKS*
2. Moving all Linux code to UTF-8 as a first step of getting rid of codepages is good. I expect all important linux code will be UTF-8 compatible in two years at most.
3. After we have everything translated to UTF-8, we can move to UCS-4 at no time. Moving there from point where we are now would be much more complicated becouse of codepages. This step (UCS-4) will make string operations both easier to write and faster while Only UTF-8 -> UCS-4 conversion will be needed in the system assuming all text will be UTF-8 already.

[reply] [top]


    [»] Re: Two steps approach
    by Daniel - Jun 18th 2002 17:29:57


    > 3. After we have everything translated
    > to UTF-8, we can move to UCS-4 at no
    > time. Moving there from point where we
    > are now would be much more complicated
    > becouse of codepages. This step (UCS-4)
    > will make string operations both easier
    > to write and faster while Only UTF-8
    > -> UCS-4 conversion will be needed in
    > the system assuming all text will be
    > UTF-8 already.

    Yes, this is the way to go. And I suspect a lot of apps won't even need to convert to UCS-4 later. It should definitely be a per-app decision to do so -- if you just want to display strings, as most GUI apps do, there is no need to deal with character-conversion at all.

    [reply] [top]


    [»] Re: Two steps approach
    by linuxknight - Jun 19th 2002 16:36:08


    > 1. Codepages *SUCKS*
    That's not valid english. It's not "Codepages sucks",
    It would be "Codepages suck", and they don't.

    --
    Signed, Linuxknight

    [reply] [top]


      [»] Re: Two steps approach
      by Daniel - Jun 19th 2002 17:32:56

      OK, let's play the game...


      > > 1. Codepages *SUCKS*
      >
      > That's not valid english. It's not
      > "Codepages sucks",

      That's not valid English either. a) You should use a capital letter, and b) a sentence ends with a period, not a comma.


      > It would be "Codepages suck",
      > and they don't.

      They do suck, as well as you. Now go home to mom and let us do something more productive than replying to this ridiculous shit you're typing.

      (Yeah, I know, I shouldn't feed the trolls. I just couldn't bear it, sorry everyone.)

      [reply] [top]


[»] Your title, "True Internationalization" holds the key to the requirements.
by Curtis Veit - Jun 15th 2002 18:13:09

Good and very relevant topic. (Stepping up on soapbox)

If it is true internationalization you are after then the only good existing answer is UTF-8 which allows multiple languages in the same document without context sensitivity for characters based on position in the document. If you really require 4 byte representation within your program you can alway convert to UCS-4 while you do your private magic.

Please read up on all the existing choices, If you do, I suspect that you will see the many advantages of UTF-8.

This also provides the advantage of being usable for multiple languages on machines that were designed for 8 bit ascii charaters without even requireing unicode conversion routines (as long as you only use ascii and utf-8). This is absolutely brilliant for embedded devices where space is still an issue.

Unfortunatly 32 bit chars do require conversion libs and will be context sensative because you cannot fit all the possible characters for all languages into a single 32 bit code space. - Or perhaps you are proposing that some people's languages are not important enough to include?
(This requires special codes to switch to a new code space. The resulting context problems are a much more severe programming problem than different byte lengths for various characters.) Also you can program your Open (or closed) source programs in utf-8 today with more than one language used in the source and it works fine, (even on my ancient systems).

You are exactly right about needing full international support in computers today. From what I have seen the people doing real work in this area go to UTF-8. You can probably tell that it is my choice as well.

(A question for Unicode wizards: why is there a common practice to be converting utf-8 to ucs-4 for storage? To me utf-8 seems to be ideal for both storage and use within programs.)

Regards,

Curtis

[reply] [top]


    [»] Re: Your title, "True Internationalization" holds the key to the requirements.
    by David Roundy - Jun 16th 2002 07:52:42


    >
    > Unfortunatly 32 bit chars do require
    > conversion libs and will be context
    > sensative because you cannot fit all the
    > possible characters for all languages
    > into a single 32 bit code space. -
    > Or perhaps you are proposing that some
    > people's languages are not important
    > enough to include?
    > (This requires special codes to switch
    > to a new code space. The resulting
    > context problems are a much more severe
    > programming problem than different byte
    > lengths for various characters.)
    >

    How many languages do you think there are? 32 bits would support 40 million languages with 1000 characters each! Admittedly, there are some languages with more than 1000 characters, but it's also true that many languages share a character set.

    [reply] [top]


      [»] Re: Your title, "True Internationalization" holds the key to the requirements.
      by piman - Jun 18th 2002 16:04:07


      > many languages share a character set.

      Yeah, and as soon as we try to combine them we get some of the existing CJK Unicode problems. The only Right Way to do it is delineate by language, not by typography.

      [reply] [top]


        [»] Re: Your title, "True Internationalization" holds the key to the requirements.
        by David Starner - Jun 25th 2002 01:05:17


        >
        > Yeah, and as soon as we try to combine
        > them we get some of the existing CJK
        > Unicode problems. The only Right Way to
        > do it is delineate by language, not by
        > typography.

        Let me guess; your native language is either C, J or K? There are two main problems to your solution. First, there are about 5,000 languages by some counts, and boundaries are very ill-defined in some case; worse yet, computers have to handle historical texts, which add a whole new dimension to the problem.

        Secondly, while Chinese and Japanese may believe their character sets are entirely disjoint, Europeans usually percieve the Latin character set as one connected whole. "valet" comes out of my keyboard just fine, and few would argue that it needs to be stored or displayed differently if it's French or English. Likewise, saying that "Ulrich Drepper, Robert M&#00fc;ller and Ri&#010d;ard &#010c;epas worked on a project" (freshmeat won't let me include the actual characters) comes naturally. Note that the only mono-lingual European character sets are ISO646-*, which had only a few characters to work with. Once eight bit sets became common, all major European character sets covered multiple languages.

        [reply] [top]


[»] Why is this taking the form of a religious debate?
by Miles - Jun 15th 2002 17:48:25

UTF-8 should be used for a text storage format. Why? Since the vast majority of documents in the world, when saved, will take up less space on the hard drive. Let's face it, as has been mentioned elsewhere, most of the text files out there are compatible with ASCII. This is not racism or imperialism. It's pragmatism.

String manipulation within programs (in-memory) is a whole different story, however. Here, a fixed-size character makes more sense in most cases. While the case for the program that simply reads data in and spits it out again verbatim is a failrly common one, the vast majority of real-world programs manipulate character strings during the course of processing.

To be more clear about this, getting the Nth character of a fixed-char-size string is a constant-time operation (O1) and takes the same amount of time whether N is equal to 5 or 500. On the other hand, getting that same character in a variable-width-char string is a linear operation (On) and takes approximately one hundred times as long to get the 500th character versus the 5th character. The same holds true for substring operations. Character replacement gets a lot more tricky as well. If all characters are treated the same, what happens when someone tries to replace the Nth character (which for the sake of this exercise, we'll call 'c') with a Thai character? In a fixed-width character string, this works just as in an ASCII string. In a variable-width character string, a lot of extra processing and data movement is necessary or subsequent characters will get overwritten due to the size difference between the two characters.

To ignore the existing base of data is a recipe for the existing base of programmers and writers to ignore you.

The first step is UTF-8 compatibility. This is a minor change to most programs. Without some tie to a universal character encoding, i18n is impractical for all intents and purposes.

The second step is to impress upon programmers the algorithmic efficiency advantages of using fixed-width characters in their programs instead of variable-width.

Finally, as time goes on and new programs are created -- especially with the advent of newer languages that encourage better i18n behaviour -- people may find it easiest to save their data in UCS-2 or UCS-4 because that is the serialized form of their in-memory data structure. Then and only then will you see the widespread changeover. Anything else is pissing into the wind.

Will it happen overnight? No. Will it happen in our lifetimes? Probably. Is it worth it to rip apart all of our existing infrastructure, effectively stop all new development, effectively halt all existing development, and recode everything right now with 4-byte characters? I certainly hope that you say 'no'. Otherwise you will be advocating the betterment of society by making it a horrible place to live.

[reply] [top]


    [»] Re: Why is this taking the form of a religious debate?
    by Srin Tuar - Jun 15th 2002 23:21:02


    > To be more clear about this, getting the
    > Nth character of a fixed-char-size
    > string is a constant-time operation (O1)
    > and takes the same amount of time
    > whether N is equal to 5 or 500. On the
    > other hand, getting that same character
    > in a variable-width-char string is a
    > linear operation (On) and takes
    > approximately one hundred times as long
    > to get the 500th character versus the
    > 5th character.
    > The second step is to impress upon
    > programmers the algorithmic efficiency
    > advantages of using fixed-width
    > characters in their programs instead of
    > variable-width.

    this is not necesarrily true: with combining characters, bidirectional text, and other unicode features you will still need to do the same amount of work with wide characters as you do with multibyte characters.

    An additional benefit of UTF-8 internal use is byte-order independance, which bypasses a perrennial problem faced when making code portable.

    [reply] [top]


      [»] Re: Why is this taking the form of a religious debate?
      by Miles - Jun 16th 2002 15:34:50

      Combining characters are not as much an issue if you enlarge your character size. Assuming a 32-bit character with one bit excluded for internal use (effectively a 31-bit character) you have 2.15 billion characters from which to choose. You will forgive me if I don't see an immediate and urgent limitation.

      "Planning for the future" is not really viable in that the future tends to draw out possibilities previously unseen. You have to work with what you've got. What happens if we have visitors from outer space and must incorporate their characters (and the other billion species')? No, not very likely. Not likely at all. But it would totally invalidate any "perfect" solution that we might come up with today. I imagine there are other possibilities not as remote as the "alien contact" mentioned above that would still put a monkey-wrench in a "perfect" character-encoding solution.

      As for bi-directional text (misleading when you consider that text in the world has more than two possible directions -- think up/down), the display order of the text is not necessarily the logical order. Just because the display of the text is right-to-left (for example) doesn't mean that the characters must be kept in memory in contrary order to left-to-right characters. It just means that the first character in a block is rendered on the right instead of the left. It does not have to be a BiDi text issue.

      You're right that byte order could be an issue on some platforms when serializing strings. Explicit serialization to UCS-2 or UCS-4 is needed. Point taken.

      [reply] [top]


      [»] I almost forgot...
      by Miles - Jun 16th 2002 15:52:23

      Just in moving people away from ASCII and the assumption that characters should be treated as a single byte, you will have solved 90% of the problem. Honestly, who cares what universal encodings are out there as long as you recognize that other encodings exist. Once a programmer recognizes that encodings (besides ASCII) exist and are worth supporting, the simple/dumb routines for character input will start to fade into the background.

      Once the data's in memory, who cares what format it's in? The only ones who must interoperate with in-memory strings are the developers and the maintainers and they should be using whatever encoding best fits the task at hand. In some cases, it'll be UTF-8. In others, it may be UCS-4. In others, Shift-JIS may fit the bill. As long as there is a conversion routine to an from Unicode from whatever you are using in your program, you can get from any encoding in the world to any other.

      But of course, the hard part so far is getting C coders to accept a non-ASCII world. This is not spite toward C for spite's sake. C (and its derivatives) are some of the last popular holdouts that (a) have little i18n and l10n support in the standard language and (b) makes little distinction between the concept of a character/string and a byte-array that represents a character/string. If it were more of an abstraction -- putting in a different back-end and letting the compiler deal with these details -- the world would be a better place (most likely with fewer buffer overflow vulnerabilities as well).

      [reply] [top]


    [»] Re: Why is this taking the form of a religious debate?
    by Shiro.k - Jun 16th 2002 19:32:49

    I do agree that fixed-length character string representation is more efficient than variable-length one. But I think it's not so bad as some claim.
    > To be more clear about this, getting the
    > Nth character of a fixed-char-size
    > string is a constant-time operation (O1)
    > and takes the same amount of time
    > whether N is equal to 5 or 500. On the
    > other hand, getting that same character
    > in a variable-width-char string is a
    > linear operation (On) and takes
    > approximately one hundred times as long
    > to get the 500th character versus the
    > 5th character. The same holds true for
    > substring operations.

    I wonder if this is really an issue. When I process the text, what I do mostly is either scan the text sequentially, or use some search operation (like substring match or regexp match). Indices are returned as the result of the search operation so that I can extract the matched region from the string, but they're not necessary to be a character index---any kind of pointer does the job, and it's possible to create such pointer object that access any part of string in constant time.

    I hardly see the case that I have to apply a pre-determined character index that is not a result of search operation. Maybe my experience is limited.

    The searching operation for variable-length characters is slower than the one for fixed-length. That's an issue. But it's not as bad as O(N) versus O(1).


    >Character replacement gets a lot more tricky as well.

    Yes, it is tricky. But again, how common is the operation to substitute characters in place? When I use the programming languages that don't have automatic memeory management, I tend to do the in-place replacement as much as possible.
    But in many situations the size of replacement string differs from the size of original region and I end up reallocating whole string.

    If the language supports garbage collection, I even think that prohibiting in-place replacement is more beneficial,
    because it allows me to share substrings without worrying that the shared storage will be modified inadvertently.
    (I assume I use some kind of "string object" that has the pointer to the storage of actual string).

    I don't have any concrete performance comparison between fixed- vs variable-length string representation.
    Excuse me if my idea is irrelevant.

    [reply] [top]


      [»] Re: Why is this taking the form of a religious debate?
      by Miles - Jun 19th 2002 17:10:06


      > I wonder if this is really an issue.
      > When I process the text, what I do
      > mostly is either scan the text
      > sequentially, or use some search
      > operation (like substring match or
      > regexp match). Indices are returned as
      > the result of the search operation so
      > that I can extract the matched region
      > from the string, but they're not
      > necessary to be a character index---any
      > kind of pointer does the job, and it's
      > possible to create such pointer object
      > that access any part of string in
      > constant time.

      True enough. In this case, you are right (for C). I've been spending a lot of time in higher-level languages and forgot some of my C idioms.


      > > Character replacement gets a lot more tricky as well.
      >
      > *** edited for brevity ***
      >
      > But in many situations the size of replacement string
      > differs from the size of original region and I end up
      > reallocating whole string.

      And here is the crux of the matter. You are experienced. You probably remember to do this everytime. However you do not write the vast majority of software out there. I would venture a guess that the vast majority of software is written by someone who is not as proficient as you are. There is nothing stopping a good coder from doing as you say. However, a lot of coders aren't that good of coders and haven't yet learned good practices or, more to the point, good practices with regard to non-ASCII character encodings. Far more likely, even if someone knows better, after a many hour coding session, programmers make stupid mistakes. When all of your unit testing is with standard ASCII strings (for example), the compiler won't catch the error. There are far more programmers out there that have written code in C for fewer than two years than programmers who have written C for more than five years. While this is true for every language, many other languages have built-in abstractions for character strings that simply don't exist for standard C.


      > I don't have any concrete performance comparison
      > between fixed- vs variable-length string representation.
      > Excuse me if my idea is irrelevant.

      Not at all irrelevant. For most of the situations you describe, you are correct in that the speed difference would be negligible (in C). I was more concerned with maintainability -- especially in situations where you are not the maintainer. But for the most part, you were more correct than I with regard to the common cases.

      [reply] [top]


        [»] Re: Why is this taking the form of a religious debate?
        by Bill Spitzak - Aug 6th 2002 02:51:24

        I recommend UTF-8 ONLY for percisely the reasons you say require wide characters. Maintainability and testing.

        If there are two interfaces for "ASCII" and "Wide characters" then the typical programmer is only going to test the "ASCII" interface and there are going to be bugs when an i18n user tries it. However if there is only ONE interface, "UTF-8", then that interface is going to be tested!

        Also no "amateur programmer" is going to successfully replace any characters in any string. The function "replace chars n-m with these m-n other characters" just is not used by anybody. Check Visual Basic if you don't believe me, that function does not exist (the replacement is allowed to be a different size). Any amateur programmers coming from that background are not going to want to do anything that you cannot do in UTF-8.

        [reply] [top]


      [»] Re: Why is this taking the form of a religious debate?
      by Bill Spitzak - Aug 6th 2002 02:47:02

      Absolutly agree. I don't understand why people thinking that indexing the Nth character needs to be fast. In any text-processing I can think of, N must be calculated first by scanning all the characters before it. It is trivial to replace N by the byte count and continue with the algorithim as before. So I see no savings there. Also all modern text processing thinks about "words" and these are variable-length. It makes no difference if the characters inside them are variable length as long as it is easy to detect the word boundaries. I recommend, with NO EXCEPTIONS, that UTF-8 be used for every single interface in a system where text is passed. There should be no "ASCII" interface, and certainly there should be no "wide character" interface. I don't think any programs will have to store or manipulate text in any form other than UTF-8. A huge win with the UTF-8 only is that it eliminates the need for multiple interfaces. strlen() and so on are unchanged except they are defined to return the number of bytes in the string.

      [reply] [top]


    [»] Re: Why is this taking the form of a religious debate?
    by linuxknight - Jun 19th 2002 16:42:00


    > most of the text files out
    > there are compatible with ASCII. This is
    > not racism or imperialism. It's
    > pragmatism.
    I agree with that.

    --
    Signed, Linuxknight

    [reply] [top]


[»] Two different point of view
by Alessandro Staltari - Jun 15th 2002 16:34:25

I think character set issue is mostly related to documents and graphical user interface rather than coding and consoles.
For the latter ASCII may be enough (does it make sense translate shell commands and language keywords? MS Excel say yes, but I'm not sure it is so useful).
Low level access to the system can still be ASCII, while for higher level interfaces we can use i18n and libraries to handle it.

[reply] [top]


[»] Sorry, UTF-8 *is* the way to go
by Srin Tuar - Jun 15th 2002 15:56:06

Rather than re-writing virtually all existing source code, it makes infinitely more sense to go with UTF-8. In fact this decision has already been made, and disparate operating systems from windows to linux (in a big way linux) are slowly standardizing on it.


Utf-8 gives you compatibility with ascii, full access to the full 31bit unicode (unicode saves the one extra bit for error codes, sign bits etc, very smart!), an error-recoverable byte stream, stateless computationally trivial conversion, very low overhead for most existing text, ovewhelming compatibility with existing software (no code changes for most software!!), relativly trivial string width computation
see for yourself: "http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c"


Using UCS-4 would be a huge headache with few benefits. It would also introduce all new kinds of bugs, like for example assuming than number of ucs-4 chars would equal the display width of the string (not true, see combining chars, zero width glyphs, etc)


[reply] [top]


[»] 16 bits is enough
by Ben Crowell - Jun 15th 2002 12:35:06

It's really a myth that 16 bits isn't enough. People make statements about how there's some huge number of Chinese characters, too many for 16 bits. But the reality is that most of these characters are names that parents have had experts create for their children in order to be able to name them something special. Nobody can read these characters. (Heck, as an American, I could name my kid with some weird symbol, and then complain that it wasn't implemented on computers.) The typical Chinese person knows a relatively small number of characters, and even highly educated people know far less than 2^16.

[reply] [top]


    [»] Re: 16 bits is enough
    by Shiro.k - Jun 15th 2002 18:01:37

    You're almost right that you don't need more than 2^16 chars for daily life, but we Japanese have historical documents and classic literatures. We can't change names of existing or existed people and places just to fit them in 2^16 charset. Unless you suggest us to abandon our history, 2^16 is not enough. (If you mean surrogating pair in UTF-16, yes, it's enough so far to represent those rare characters in the stream of 2^16 codes.)

    [reply] [top]


    [»] Re: 16 bits is enough
    by cpchan - Jun 15th 2002 18:10:17


    > the reality is that most of these
    > characters are names that parents have
    > had experts create for their children in
    > order to be able to name them something
    > special.

    What have you been smoking? I want some! I both speak and write Chinese fluently and this is the first time I have heard of such nonsense.

    Charles

    [reply] [top]


    [»] Re: 16 bits is enough
    by David Starner - Jun 25th 2002 01:10:35


    > It's really a myth that 16 bits isn't
    > enough. People make statements about how
    > there's some huge number of Chinese
    > characters, too many for 16 bits. [...]

    It's really a moot point. There's no competing 16 bit standard. To support Cantonese, Hong Kong Chinese and modern musical notations (all important things), one must support full 32-bit Unicode. (For Cantonese, I'm told one of the characters is the equivalent of the English -ing suffix, necessary for almost any written Cantonese.)

    [reply] [top]


[»] What about typesetting?
by Gregor Mueckl - Jun 15th 2002 10:59:11

This article only covers the characters and fonts involved in outputting text written in different languages. That's a problem in multilanguage environments like the internet, but it's not the only one.

The output has to be displayed correctly as well. Western languages are written and read from left to right. Most text printing routines can only handle that way of outputting text. But it's not the only one. Arabic text goes from right to left (just the opposite of western languages). And if I remember corretly, some eastern languages are even written from top to bottom, that is, in colums rather than lines.

So you have to think about a way to handle output as well. A graphical system can print single characters correctly, but they need to be alligned correctly to form a text that makes esnes - sorry, I meant "sense".

I think that this becomes a serious problem if you build a system that is capable of printing all those characters at the same time. How do you typeset a text that contains passages in English, Arabic and Chinese?

[reply] [top]


    [»] Re: What about typesetting?
    by Christian Rose - Jun 15th 2002 13:04:39


    > The output has to be displayed correctly
    > as well. Western languages are written
    > and read from left to right. Most text
    > printing routines can only handle that
    > way of outputting text. But it's not the
    > only one. Arabic text goes from right to
    > left (just the opposite of western
    > languages). And if I remember corretly,
    > some eastern languages are even written
    > from top to bottom, that is, in colums
    > rather than lines.
    >
    > So you have to think about a way to
    > handle output as well. A graphical
    > system can print single characters
    > correctly, but they need to be alligned
    > correctly to form a text that makes
    > esnes - sorry, I meant
    > "sense".
    >
    > I think that this becomes a serious
    > problem if you build a system that is
    > capable of printing all those characters
    > at the same time. How do you typeset a
    > text that contains passages in English,
    > Arabic and Chinese?

    Umm, I think these problems have in large parts already been solved by systems like Pango etc. Are you familiar with Pango?

    [reply] [top]


    [»] Re: What about typesetting?
    by Emil Perhinschi - Jul 3rd 2002 19:52:06

    Why not let all this internationalization to be the burden of typesetting and wordprocessing programs? The ``source'' can be stored in ascii without much trouble ... I did this in LaTeX with English, Romanian and Polytonic Greek in one document ... Then think of the costs of internationalization, the trouble of standardization and coping with programmers' hybris ... Emil P.S.: I have no claim to correct English in my reply

    [reply] [top]


[»] 32-bit text format
by Deekoo L. - Jun 15th 2002 07:18:24

(Of course, this appears right as I'm uploading
a UTF-8-native version of Yeemp...)

Problems - first off, everything has to be
audited and recompiled - malloc(strlen(src)+1)
needs to become malloc(strlen(src)+sizeof(char))
everywhere it appears.

Second: Confining things to 7bit seems wasteful.
The only reason to use 7bit is to transfer data
cleanly over 7bit-only links. 7-bit protocols
will require that the commands be in 8-bit chars,
even if the data are in 32-bit chars. To deal
cleanly with many 7-bit protocols, you'll need to
avoid using a large number of control and ASCII
glyphs in the 32-bit chars. Worse, the glyphs
to avoid differ from protocol to protocol.
Embedded nulls and CRs or LFs will break almost
any 7-bit protocol; @ signs in the wrong place
will choke SMTP; . will confuse domain resolvers;
space will confuse webservers. The characters
remaining for your encoding (and that's just after
chopping the ones that I think'll cause problems)
will probably make Base64 look pleasant. Further,
parsing a glyph index composed of discontiguous
septets in a 32-bit word will be a nuisance to
any program which has to deal with them. If
you're changing the char size, it breaks enough
stuff as-is that there's no point in trying to
get it 7-bit-clean *too*.


However, I do want one single giant character set, whether
it's 16-bit, 32-bit, or something else. Having
to tag every bit of content with encodings is
annoying (when it's text files), infuriating
(when it comes to files with multiple chunks of
data in different encodings), unfeasible (how
am I supposed to indicate the encoding for
user@host.co.uk, where 'host' is in Devanagari
and 'user' in Hangul?), and unreliable (when every
web browser comes with a list of every encoding
that any other web browser ever claimed to support...).

IMO, display, character sets, and input are
things that should be semi-independant - I don't
want to put the BFF on my emergency boot
disk, and input methods that are great for one
language may be suboptimal or terrible for
another.

[reply] [top]


    [»] sizeof(char)
    by mikpos - Jun 15th 2002 10:45:49


    > Problems - first off, everything has to
    > be
    > audited and recompiled -
    > malloc(strlen(src)+1)
    > needs to become
    > malloc(strlen(src)+sizeof(char))
    > everywhere it appears.

    Do you mean malloc(strlen(src)+1) should become malloc((strlen(src)+1)*sizeof(char)) everywhere? In any case it doesn't matter since the C standard requires that sizeof(char) be 1. Any compiler which doesn't make sizeof(char) 1 is non-conforming. This makes sense since the sizeof operator returns the size of its operand in chars. How many chars are there in a char? Well 1 of course.

    Of course this does not preclude 32-bit chars. There are lots of implementations out there where CHAR_BIT is 32.

    Using 32-bit wchar_ts would make more sense than making 32-bit chars though.


    > However, I do want one single giant
    > character set, whether
    > it's 16-bit, 32-bit, or something else.

    I agree. We already have such a thing, though: UCS-2 and UCS-4. Strings of Unicode characters that are uniformly 16 bits or 32 bits (respectively) in size. I didn't see the author proposing anything that UCS-4 wouldn't fix.

    I'd rather use UTF-8 than UCS-4, though.

    [reply] [top]


[»] It's not that bad, actually
by Daniel - Jun 15th 2002 06:40:54

What's wrong with UTF-8? It's an 8 bit multi-byte encoding, and therefore independent of endianess. It's absolutely compatible with plain 7 bit ASCII and easy to program for. Most of the code written for ASCII continues to work with UTF-8, e.g. substring search.

Also, it's just not true that Unicode text files start with a special character sequence. That might be a bad Windows habit, but it's not required by any standard. Once everyone moved to UTF-8, we can forget about all those character set problems.

Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.

That's just nonsense. UTF-8 is not hard to program for. And if you don't want to do it yourself, there are a whole lot of libraries out there that deal with it. Trust me, it's going to become the standard on GNU/Linux systems in the near future.

Regarding "Western domination": Please try to view this more pragmatically. The worst that can happen with UTF-8 over UCS-4 (the 32bit Unicode encoding), is that a full-length character needs 6 bytes rather than 4. But in practice, most characters won't need full 6 bytes -- for instance, Japanese fits just fine into UCS-2 (16 bit), right? Why should it need more than 4 bytes per character in UTF-8 encoding?

And you can't just declare ASCII obsolete. It's just fine for English, and face it: Any serious programming has to be done in English nowadays (at least Open Source programming), and I doubt this will change in the foreseeable future.

Maybe you should have a look at GTK+ and Pango. GTK+ 2.0 uses UTF-8 for all text now.

[reply] [top]


    [»] Re: It's not that bad, actually
    by Murray Cumming - Jun 15th 2002 07:00:57


    > Unlike UTF-8, we don't want different
    > sizes for Western and Eastern
    > characters; that makes programmers
    > unhappy and software difficult to
    > control. Also, UTF-8 emphasizes
    > historical Western domination of
    > computing science, which is not very
    > friendly.

    I don't think it's unfair to give greater importance to alphabetic, or even syllabic, writing systems. We want to support those other complex character-per-word character sets, but they are purely "legacy" languages. That's not cultural bias - that's system analysis. This opinion is supported by the popularity of romanized japanese as an input system even among native speakers of japanese.

    [reply] [top]


      [»] Re: It's not that bad, actually
      by Bill Spitzak - Aug 6th 2002 03:00:06

      More importantly, even real Japanese or Chinese has so many spaces, digits, control characters, punctuation, and imbedded latin text, that it will be shorter in UTF-8 than in UCS-16 or any of the other proposed encodings. There is no bias whatsoever in UTF-8, it really is a crude form of Huffman encoding. I also see no reason why a Chinese word that translates to a several-character word in english must be stored in one-character of space, if anything you are presenting a reverse-bias.

      [reply] [top]


    [»] Re: It's not that bad, actually
    by Nudge - Jun 15th 2002 07:08:10


    >
    >
    > Also, it's just not true that Unicode
    > text files start with a special
    > character sequence. That might be a bad
    > Windows habit, but it's not required by
    At least with 16bit Unicode, there little and big endian is marked. I don't know exactly the sequence, it's a 16bit (2 bytes) scheme that differs between Windows and Posix.


    > That's just nonsense. UTF-8 is not hard
    > to program for. And if you don't want to
    > Trust me, it's going to become the
    > standard on GNU/Linux systems in the
    > near future.
    > And you can't just declare ASCII
    > obsolete. It's just fine for English,
    That's all?


    > and face it: Any serious programming has
    > to be done in English nowadays (at least
    That's not a serious argument. My article is about global thinking, and all you talk about is English in every direction.


    > Open Source programming), and I doubt
    > this will change in the foreseeable
    > future.
    That's exactly the point. If we are not willing to leave ASCII behind us, there will never be a clear encoding scheme, and my point was that, when character sizes differ from one language to another, there will also be more difficulties than to use a standardized 32bit character width. And, if you have an array containing characters (a string), then every data unit is of the same size - or not? So you have to convert it internally to same bit width anyway to work with...


    > Maybe you should have a look at GTK+ and
    > Pango. GTK+ 2.0 uses UTF-8 for all
    > text now.
    Maybe we use UTF-8 in 2 years everywhere. And after 2 more years, we will again ask ourselves why there are different sizes for characters, and then we come up with this "historical" stuff once again and again and again... And - who wants to use a library to write simple strings to a file? We put so much energy in these sophisticated libraries but have no power to overcome the past mistakes to avoid the whole workaround.

    What about the central parsing engine idea?


    >
    >

    [reply] [top]


      [»] Re: It's not that bad, actually
      by Daniel - Jun 15th 2002 07:31:04


      > At least with 16bit Unicode, there
      > little and big endian is marked. I don't
      > know exactly the sequence, it's a 16bit
      > (2 bytes) scheme that
      > differs between Windows and Posix.

      I'm talking about UTF-8 text files.


      > > And you can't just declare ASCII
      > > obsolete. It's just fine for English,
      >
      > That's all?

      Yes, that's all.


      > > and face it: Any serious programming has
      > > to be done in English nowadays (at least
      >
      > That's not a serious argument. My article is about
      > global thinking, and all you talk about is English
      > in every direction.

      I'm talking about source code, not the user interface. All pilots use English to communicate with each other, for good reasons. The same is true for programmers. You need a least common denominator.


      > > Open Source programming), and I doubt
      > > this will change in the foreseeable
      > > future.
      >
      > That's exactly the point. If we are not willing
      > to leave ASCII behind us, there will never be
      > a clear encoding scheme, and my point was that,
      > when character sizes differ from one language
      > to another, there will also be more difficulties than
      > to use a standardized 32bit character width.
      > And, if you have an array containing characters
      > (a string), then every data unit is of the same
      > size - or not? So you have to convert it internally
      > to same bit width anyway to work with...

      No. Quite the contrary: UTF-8 is meant for text, not single characters. Perhaps you should make yourself familiar with UTF-8. See http://www.cl.cam.ac.uk/~mgk25/unicode.html .


      > Maybe we use UTF-8 in 2 years everywhere.
      > And after 2 more years, we will again ask ourselves
      > why there are different sizes for characters, and
      > then we come up with this "historical"
      > stuff once again and again and again...

      If we go for fixed-length characters, maybe we'll ask ourselves then why we only have 32 bit...


      > And - who wants to use a library to
      > write simple strings to a file? We put so much energy
      > in these sophisticated libraries but
      > have no power to overcome the past
      > mistakes to avoid the whole workaround.

      You won't need a library to write simple strings to a file.

      [reply] [top]


        [»] Re: It's not that bad, actually
        by Nudge - Jun 15th 2002 08:01:14


        > I'm talking about source code, not the
        > user interface. All pilots use English

        Why, is there a difference? Languages should
        be available for the console/terminal, imho. So
        one can program in non-ASCII. I don't see any
        difficulties besides the recoding of given sources.
        The second thing I want to point out is, that a GUI needs, for example, the next char