Date: Thu, 8 Aug 2013 00:30:35 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Re: Status of Big5 and extensions On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@...ifal.cx> [2013-08-07 22:11:19 -0400]: > > Since you mentioned Big5-2003, I've been looking into it, and it seems > > like it should be part of our base Big5 mapping. Diffing moztw's > > version of it against CP950.TXT (after cleaning up both), I get: > > i checked an other source for big5-2003 and it is bug compatible > with the moztw one (so it might not be mozilla's fault) > http://www.csie.ntu.edu.tw/~r92030/project/big5/ > > this source maps C255 to 5F5E instead of 5F5D > (also observed in the icu version of cp950) Unfortunately this mismatches the normative Unihan.txt which says U+5F5D corresponds to the historical Big5 character C255, so we need at least some justification for the change if Unihan.txt is buggy. > > These are all part of ETEN omitted from CP950, and should definitely > > be in Big5 base. > > > > +0xC6A1 0x2460 > > +0xC6A2 0x2461 > > +0xC6A3 0x2462 > > +0xC6A4 0x2463 > > +... > > +0xC7F1 0x30F5 > > +0xC7F2 0x30F6 > > > > These are also from ETEN. Notably, the Cyrillic block that immediately > > follows these is still omitted in Big5-2003, for reasons that appear > > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not > > to add the Cyrillic block back in here. > > > > the C6BF-C6D9 part is incompatible in hkscs and big5-2003 > hkscs == uao != big5-2003 for these codes > icu agrees with the old hkscs pua codes so this might be > just a bug in the big5-2003 source I believe I've dug up the story on this here: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035389.html In short, the Big5-2003 mappings from moztw.org are wrong. The "KANGXI RADICAL" characters in Unicode are compatibility characters. This means they have compatibility-equivalents which should be used in Unicode documents in place of them, much like the Greek letter μ should be used in place of Latin-1 MICRO SIGN (µ), and only exist for round-trip compatibility with legacy character sets which encode the character twice. Since Big5 (unlike CNS 11643) does not encode the Kangxi radicals twice, using them in a mapping to Unicode is wrong use of Unicode, regardless of what the mapping table from the standards body says. Thus, I have no problem with going with the UAO/HKSCS way. According to the above link, however, HKSCS has introduced a problem. They've double mapped U+5E7A and are thus mapping the one in the C6CD slot to U+2F33 instead, since the FBF4 slot is mapping to U+5E7A. I'm not sure what the right solution to this is; since we're not interested in round-trip, it might make the most sense to just ignore it and map them both to the (same) proper character. > > -0xF9FA 0x256D > > -0xF9FB 0x256E > > -0xF9FC 0x2570 > > -0xF9FD 0x256F > > +0xF9FA 0x2554 > > +0xF9FB 0x2557 > > +0xF9FC 0x255A > > +0xF9FD 0x255D > > > > This looks like pure Mozilla cruft. Is there any justification for > > these sorts of changes? > > > > these are box drawing chars (like A2A4-A2A7 above), > the diff is double vs light lines > > cp950 == hkscs == uao != big-2003 (and missing from icu) > > hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu) I don't really care so much about these anyway since they do not affect linguistic content, just warez .nfo files. ;-) > > Does the above analysis look correct? If so I will go ahead and merge > > the above changes to Big5 support into musl. > > > > BTW, the only non-PUA part of UAO within the standard Big5 range > > (89x157 grid) that won't be mapped with these changes is the stuff > > right after the Cyrillic block. This part does not conflict with > > current HKSCS, so if I had good sources from both the Taiwan and HK > > sides supporting the position that these mappings will not conflict > > with other extensions in current use or with future expansion of > > HKSCS, we could consider including that part of UAO in the base Big5 > > mapping. At this point this is only an idea for consideration, but we > > can keep it in mind. > > note that > C87A, C87C, C8A4 are mapped to 2xxxx in hkscs > (old hkscs pua codes agree with uao) OK, so is this non-conflicting? Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.