|
Message-ID: <op.w1g1tjtzdyj81a@monster.itedn32a.localdomain> Date: Thu, 08 Aug 2013 08:18:45 +0800 From: Roy <roytam@...il.com> To: musl@...ts.openwall.com Subject: Re: Status of Big5 and extensions 在 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker <dalias@...ifal.cx> 寫道: > OK, so after a lot of research and discussion, I'm about to commit the > first part of Big5 support in iconv. The plain "Big5" charset name is > going to be the maximal set everybody agrees on, which, as far as I > can tell, is just CP950. (Actually even IBM's Big5 variant in ICU > differs from CP950 in a few places, but it's just wrong. The one > ideograph where it differs conflicts with Unihan.txt, which is > authoritative on which Unicode characters encode which character > identities from historical CJK charsets.) > > As for extensions, my understanding of HKSCS is to a point now where I > feel we can add it (charset name "Big5-HKSCS"), based on the 2008 > government publication which has a few new characters beyond the old > versions in most software. (Thank you nsz for helping dig up all the > files and researching how they differ!) However there are a few > technical difficulties to implementation: the Unicode codepoints span > a 17-bit range rather than just a 16-bit range, so we need an > efficient way of doing the mappings. What's worse, several HKSCS > codepoints map to Latin characters with multiple combining marks which > have no precomposed representation. Supporting these will require > extending iconv to be able to output two or more characters of output > for each unit of input, which is mildly error-prone with the current > design, so I may hold off on HKSCS support until I overhaul some of > the core logic of the iconv function. > > Now, the hard part: Taiwan extensions. I appreciate all the help from > Roy, but I'm still not to the point of having anything nearly ready to > be added. The UAO extension set has at least 290 mappings to > codepoints in the Unicode Private Use Areas (PUA), which makes it > unsuitable for inclusion as-is. This may be a resolvable issue if > these 290 characters all exist in Unicode and could be remapped, but I > do not feel any of us are qualified to determine this. So UAO is > probably still a long way away from being able to be adopted into > iconv, unless we have authoritative data on the identities of these > characters in the form of something from an official standards body or > at the very least multiple major vendors (enough to ensure that any > future standardization would be consistent and non-controversial). > I did ask one of creator of UAO in http://forum.moztw.org/viewtopic.php?f=10&t=40174 And there is a reply of those PUA characters in Big5 view: 0xFA40 - 0xFA63: Reserved for user-defined characters 0xC8A5 - 0xC8B0: for Big5-2003 compliant Others: "Art characters" of ChinaSea character set(csw 1.0, i.e. cswsmin.tte), which are not mapped to Unicode and the codepoint that has not occupied by HKSCS codepoints. > If there are other supersets of CP950 (possibly subsets of UAO) which > would be useful for supporting Taiwanese users, I would very much like > to understand the situation on them and whether it's feasible to > include them in iconv at this point. Perhaps the most important thing > to know at this point is what practical difficulties exist for > Taiwanese users limited to CP950, and which extension charsets could > solve this. I feel like issues like "I cannot write my own name" are > on a completely different level than "I can't mix Japanese text in > with my legacy-encoded data [when I really should be using Unicode for > this anyway]". The former sort of issue is something that demands some > sort of support, even if imperfect or mildly hackish, whereas the > latter is not a justification for hacks and bypassing proper standards > processes. > > Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.