Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 8 Aug 2013 05:53:21 +0200
From: Szabolcs Nagy <>
Subject: Re: Re: Status of Big5 and extensions

* Rich Felker <> [2013-08-07 22:11:19 -0400]:
> Since you mentioned Big5-2003, I've been looking into it, and it seems
> like it should be part of our base Big5 mapping. Diffing moztw's
> version of it against CP950.TXT (after cleaning up both), I get:

i checked an other source for big5-2003 and it is bug compatible
with the moztw one (so it might not be mozilla's fault)

this source maps C255 to 5F5E instead of 5F5D
(also observed in the icu version of cp950)

> -0xA156 0x2013
> +0xA156 0x2015
> ...
> -0xA1C2 0x00AF
> +0xA1C2 0x203E
> ...
> -0xA2A4 0x2550
> -0xA2A5 0x255E
> -0xA2A6 0x256A
> -0xA2A7 0x2561
> +0xA2A4 0x2501
> +0xA2A5 0x251D
> +0xA2A6 0x253F
> +0xA2A7 0x2525
> The above all looks like pure nonstandard Mozilla behavior. What's
> next is more interesting:
> -0xA2CC 0x5341
> -0xA2CD 0x5344
> -0xA2CE 0x5345
> +0xA2CC 0x3038
> +0xA2CD 0x3039
> +0xA2CE 0x303A
> This looks to me like an actual bug in CP905.TXT from Unicode.
> Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
> for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
> which is in itself almost certainly a bug.
> +0xA3C0 0x2400
> +0xA3C1 0x2401
> +...
> +0xA3DF 0x241F
> +0xA3E0 0x2421
> These are all part of ETEN omitted from CP950, and should definitely
> be in Big5 base.
> +0xC6A1 0x2460
> +0xC6A2 0x2461
> +0xC6A3 0x2462
> +0xC6A4 0x2463
> +...
> +0xC7F1 0x30F5
> +0xC7F2 0x30F6
> These are also from ETEN. Notably, the Cyrillic block that immediately
> follows these is still omitted in Big5-2003, for reasons that appear
> political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> to add the Cyrillic block back in here.

the C6BF-C6D9 part is incompatible in hkscs and big5-2003
hkscs == uao != big5-2003 for these codes
icu agrees with the old hkscs pua codes so this might be
just a bug in the big5-2003 source

> Finally:
> -0xF9FA 0x256D
> -0xF9FB 0x256E
> -0xF9FC 0x2570
> -0xF9FD 0x256F
> +0xF9FA 0x2554
> +0xF9FB 0x2557
> +0xF9FC 0x255A
> +0xF9FD 0x255D
> This looks like pure Mozilla cruft. Is there any justification for
> these sorts of changes?

these are box drawing chars (like A2A4-A2A7 above),
the diff is double vs light lines

cp950 == hkscs == uao != big-2003 (and missing from icu)

hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

> Does the above analysis look correct? If so I will go ahead and merge
> the above changes to Big5 support into musl.
> BTW, the only non-PUA part of UAO within the standard Big5 range
> (89x157 grid) that won't be mapped with these changes is the stuff
> right after the Cyrillic block. This part does not conflict with
> current HKSCS, so if I had good sources from both the Taiwan and HK
> sides supporting the position that these mappings will not conflict
> with other extensions in current use or with future expansion of
> HKSCS, we could consider including that part of UAO in the base Big5
> mapping. At this point this is only an idea for consideration, but we
> can keep it in mind.

note that
C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
(old hkscs pua codes agree with uao)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.