|
Message-ID: <20130808035321.GN25714@port70.net> Date: Thu, 8 Aug 2013 05:53:21 +0200 From: Szabolcs Nagy <nsz@...t70.net> To: musl@...ts.openwall.com Subject: Re: Re: Status of Big5 and extensions * Rich Felker <dalias@...ifal.cx> [2013-08-07 22:11:19 -0400]: > Since you mentioned Big5-2003, I've been looking into it, and it seems > like it should be part of our base Big5 mapping. Diffing moztw's > version of it against CP950.TXT (after cleaning up both), I get: i checked an other source for big5-2003 and it is bug compatible with the moztw one (so it might not be mozilla's fault) http://www.csie.ntu.edu.tw/~r92030/project/big5/ this source maps C255 to 5F5E instead of 5F5D (also observed in the icu version of cp950) > -0xA156 0x2013 > +0xA156 0x2015 > ... > -0xA1C2 0x00AF > +0xA1C2 0x203E > ... > -0xA2A4 0x2550 > -0xA2A5 0x255E > -0xA2A6 0x256A > -0xA2A7 0x2561 > +0xA2A4 0x2501 > +0xA2A5 0x251D > +0xA2A6 0x253F > +0xA2A7 0x2525 > > The above all looks like pure nonstandard Mozilla behavior. What's > next is more interesting: > > -0xA2CC 0x5341 > -0xA2CD 0x5344 > -0xA2CE 0x5345 > +0xA2CC 0x3038 > +0xA2CD 0x3039 > +0xA2CE 0x303A > > This looks to me like an actual bug in CP905.TXT from Unicode. > Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same > for the others. Indeed, CP905.TXT maps these in a non-one-to-one way, > which is in itself almost certainly a bug. > > +0xA3C0 0x2400 > +0xA3C1 0x2401 > +... > +0xA3DF 0x241F > +0xA3E0 0x2421 > > These are all part of ETEN omitted from CP950, and should definitely > be in Big5 base. > > +0xC6A1 0x2460 > +0xC6A2 0x2461 > +0xC6A3 0x2462 > +0xC6A4 0x2463 > +... > +0xC7F1 0x30F5 > +0xC7F2 0x30F6 > > These are also from ETEN. Notably, the Cyrillic block that immediately > follows these is still omitted in Big5-2003, for reasons that appear > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not > to add the Cyrillic block back in here. > the C6BF-C6D9 part is incompatible in hkscs and big5-2003 hkscs == uao != big5-2003 for these codes icu agrees with the old hkscs pua codes so this might be just a bug in the big5-2003 source > Finally: > > -0xF9FA 0x256D > -0xF9FB 0x256E > -0xF9FC 0x2570 > -0xF9FD 0x256F > +0xF9FA 0x2554 > +0xF9FB 0x2557 > +0xF9FC 0x255A > +0xF9FD 0x255D > > This looks like pure Mozilla cruft. Is there any justification for > these sorts of changes? > these are box drawing chars (like A2A4-A2A7 above), the diff is double vs light lines cp950 == hkscs == uao != big-2003 (and missing from icu) hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu) > Does the above analysis look correct? If so I will go ahead and merge > the above changes to Big5 support into musl. > > BTW, the only non-PUA part of UAO within the standard Big5 range > (89x157 grid) that won't be mapped with these changes is the stuff > right after the Cyrillic block. This part does not conflict with > current HKSCS, so if I had good sources from both the Taiwan and HK > sides supporting the position that these mappings will not conflict > with other extensions in current use or with future expansion of > HKSCS, we could consider including that part of UAO in the base Big5 > mapping. At this point this is only an idea for consideration, but we > can keep it in mind. note that C87A, C87C, C8A4 are mapped to 2xxxx in hkscs (old hkscs pua codes agree with uao)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.