musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <op.w1b4hubbdyj81a@monster.itedn32a.localdomain>
Date: Mon, 05 Aug 2013 16:28:32 +0800
From: Roy <roytam@...il.com>
To: musl@...ts.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far

Since I'm a Traditional Chinese and Japanese legacy encoding user, I think  
I can say something here.

Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@...ifal.cx> wrote:

> OK, so here's what I've found so far. Both legacy Korean and legacy
> Traditional Chinese encodings have essentially a single base character
> set:
>

>
> Traditional Chinese:
> Big5 (CP950)
> 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
> All characters in BMP
> 27946 bytes table space
>
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
>
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
>

There is another Big5 extension called Big5-UAO, which is being used in  
world's largest telnet-based BBS called "ptt.cc".

It has two tables, one for Big5-UAO to Unicode, another one is Unicode to  
Big5-UAO.
http://moztw.org/docs/big5/table/uao250-b2u.txt
http://moztw.org/docs/big5/table/uao250-u2b.txt

Which extends DBCS lead byte to 0x81.

> The big remaining questions are:
>
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For Big5-UAO, it contains Japanese and Simplified Chinese characters which  
do not exist in original MS-CP950 implementation.

>
> 2. Are there patterns to exploit? For Korean, ALL of the Hangul
> characters are actually combinations of several base letters. Unicode
> encodes them all sequentially in a pattern where the conversion to
> their constitutent letters is purely algorithmic, but there seems to
> be no clean pattern in the legacy encodings, as the encodings started
> out just incoding the "important" ones then adding less important
> combinations in separate ranges.

In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in  
Japanese) and Japanese Katakana/Hiragana besides of Hangul characters.

>
> Worst-case, adding Korean and Traditional Chinese tables will roughly
> double the size of iconv.o to around 150k. This will noticably enlarge
> libc.so, but will make no difference to static-linked programs except
> those using iconv. I'm hoping we can make these additions less
> expensive, but I don't see a good way yet.

For static linking, can we have conditional linking like QT does?
In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables.

#ifndef QT_SHARED
     #include <QtPlugin>

     Q_IMPORT_PLUGIN(qcncodecs)
     Q_IMPORT_PLUGIN(qjpcodecs)
     Q_IMPORT_PLUGIN(qkrcodecs)
     Q_IMPORT_PLUGIN(qtwcodecs)
#endif


>
> At some point, especially if the cost is not reduced, I will probably
> add build-time options to exclude a configurable subset of the
> supported character encodings. This would not be extremely
> fine-grained, and the choices to exclude would probably be just:
> Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
> 8-bit might also be an option but these are so small I can't think of
> cases where it would be beneficial to omit them (5k for the tables on
> top of the 2k of actual code in iconv). Perhaps if there are cases
> where iconv is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices, dropping the
> 8-bit tables and all of the support code could be useful; the
> resulting iconv would be around 1k, I think.
>
> Rich
>

HTH,
Roy
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.