|
Message-ID: <20130807165044.GA14867@brightrain.aerifal.cx> Date: Wed, 7 Aug 2013 12:50:44 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Status of Big5 and extensions OK, so after a lot of research and discussion, I'm about to commit the first part of Big5 support in iconv. The plain "Big5" charset name is going to be the maximal set everybody agrees on, which, as far as I can tell, is just CP950. (Actually even IBM's Big5 variant in ICU differs from CP950 in a few places, but it's just wrong. The one ideograph where it differs conflicts with Unihan.txt, which is authoritative on which Unicode characters encode which character identities from historical CJK charsets.) As for extensions, my understanding of HKSCS is to a point now where I feel we can add it (charset name "Big5-HKSCS"), based on the 2008 government publication which has a few new characters beyond the old versions in most software. (Thank you nsz for helping dig up all the files and researching how they differ!) However there are a few technical difficulties to implementation: the Unicode codepoints span a 17-bit range rather than just a 16-bit range, so we need an efficient way of doing the mappings. What's worse, several HKSCS codepoints map to Latin characters with multiple combining marks which have no precomposed representation. Supporting these will require extending iconv to be able to output two or more characters of output for each unit of input, which is mildly error-prone with the current design, so I may hold off on HKSCS support until I overhaul some of the core logic of the iconv function. Now, the hard part: Taiwan extensions. I appreciate all the help from Roy, but I'm still not to the point of having anything nearly ready to be added. The UAO extension set has at least 290 mappings to codepoints in the Unicode Private Use Areas (PUA), which makes it unsuitable for inclusion as-is. This may be a resolvable issue if these 290 characters all exist in Unicode and could be remapped, but I do not feel any of us are qualified to determine this. So UAO is probably still a long way away from being able to be adopted into iconv, unless we have authoritative data on the identities of these characters in the form of something from an official standards body or at the very least multiple major vendors (enough to ensure that any future standardization would be consistent and non-controversial). If there are other supersets of CP950 (possibly subsets of UAO) which would be useful for supporting Taiwanese users, I would very much like to understand the situation on them and whether it's feasible to include them in iconv at this point. Perhaps the most important thing to know at this point is what practical difficulties exist for Taiwanese users limited to CP950, and which extension charsets could solve this. I feel like issues like "I cannot write my own name" are on a completely different level than "I can't mix Japanese text in with my legacy-encoded data [when I really should be using Unicode for this anyway]". The former sort of issue is something that demands some sort of support, even if imperfect or mildly hackish, whereas the latter is not a justification for hacks and bypassing proper standards processes. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.