Date: Mon, 05 Aug 2013 16:28:32 +0800 From: Roy <roytam@...il.com> To: musl@...ts.openwall.com Subject: Re: iconv Korean and Traditional Chinese research so far Since I'm a Traditional Chinese and Japanese legacy encoding user, I think I can say something here. Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@...ifal.cx> wrote: > OK, so here's what I've found so far. Both legacy Korean and legacy > Traditional Chinese encodings have essentially a single base character > set: > > > Traditional Chinese: > Big5 (CP950) > 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) > All characters in BMP > 27946 bytes table space > > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > There is another Big5 extension called Big5-UAO, which is being used in world's largest telnet-based BBS called "ptt.cc". It has two tables, one for Big5-UAO to Unicode, another one is Unicode to Big5-UAO. http://moztw.org/docs/big5/table/uao250-b2u.txt http://moztw.org/docs/big5/table/uao250-u2b.txt Which extends DBCS lead byte to 0x81. > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For Big5-UAO, it contains Japanese and Simplified Chinese characters which do not exist in original MS-CP950 implementation. > > 2. Are there patterns to exploit? For Korean, ALL of the Hangul > characters are actually combinations of several base letters. Unicode > encodes them all sequentially in a pattern where the conversion to > their constitutent letters is purely algorithmic, but there seems to > be no clean pattern in the legacy encodings, as the encodings started > out just incoding the "important" ones then adding less important > combinations in separate ranges. In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in Japanese) and Japanese Katakana/Hiragana besides of Hangul characters. > > Worst-case, adding Korean and Traditional Chinese tables will roughly > double the size of iconv.o to around 150k. This will noticably enlarge > libc.so, but will make no difference to static-linked programs except > those using iconv. I'm hoping we can make these additions less > expensive, but I don't see a good way yet. For static linking, can we have conditional linking like QT does? In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables. #ifndef QT_SHARED #include <QtPlugin> Q_IMPORT_PLUGIN(qcncodecs) Q_IMPORT_PLUGIN(qjpcodecs) Q_IMPORT_PLUGIN(qkrcodecs) Q_IMPORT_PLUGIN(qtwcodecs) #endif > > At some point, especially if the cost is not reduced, I will probably > add build-time options to exclude a configurable subset of the > supported character encodings. This would not be extremely > fine-grained, and the choices to exclude would probably be just: > Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy > 8-bit might also be an option but these are so small I can't think of > cases where it would be beneficial to omit them (5k for the tables on > top of the 2k of actual code in iconv). Perhaps if there are cases > where iconv is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, dropping the > 8-bit tables and all of the support code could be useful; the > resulting iconv would be around 1k, I think. > > Rich > HTH, Roy
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.