Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <op.w1hbi0o8dyj81a@monster.itedn32a.localdomain>
Date: Thu, 08 Aug 2013 11:48:26 +0800
From: Roy <roytam@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Re: Status of Big5 and extensions

On Thu, 08 Aug 2013 10:11:19 +0800, Rich Felker <dalias@...ifal.cx> wrote:

> On Thu, Aug 08, 2013 at 08:18:45AM +0800, Roy wrote:
>> >Now, the hard part: Taiwan extensions. I appreciate all the help from
>> >Roy, but I'm still not to the point of having anything nearly ready to
>> >be added. The UAO extension set has at least 290 mappings to
>> >codepoints in the Unicode Private Use Areas (PUA), which makes it
>> >unsuitable for inclusion as-is. This may be a resolvable issue if
>> >these 290 characters all exist in Unicode and could be remapped, but I
>> >do not feel any of us are qualified to determine this. So UAO is
>> >probably still a long way away from being able to be adopted into
>> >iconv, unless we have authoritative data on the identities of these
>> >characters in the form of something from an official standards body or
>> >at the very least multiple major vendors (enough to ensure that any
>> >future standardization would be consistent and non-controversial).
>> >
>>
>> I did ask one of creator of UAO in
>> http://forum.moztw.org/viewtopic.php?f=10&t=40174
>> And there is a reply of those PUA characters in Big5 view:
>> 0xFA40 - 0xFA63: Reserved for user-defined characters
>
> So these can simply be dropped from the mapping.
>
>> 0xC8A5 - 0xC8B0: for Big5-2003 compliant
>
> I do not see any C8xx mappings in Big5-2003, so this explanation does
> not seem plausible.

So do I, but this area exists in draft marked as reserved.
http://web.archive.org/web/20041210015709/http://pingyeh.net/big5/big5-2003-v3/summary.txt

>
> Since you mentioned Big5-2003, I've been looking into it, and it seems
> like it should be part of our base Big5 mapping. Diffing moztw's
> version of it against CP950.TXT (after cleaning up both), I get:
>
> -0xA156 0x2013
> +0xA156 0x2015
> ...
> -0xA1C2 0x00AF
> +0xA1C2 0x203E
> ...
> -0xA2A4 0x2550
> -0xA2A5 0x255E
> -0xA2A6 0x256A
> -0xA2A7 0x2561
> +0xA2A4 0x2501
> +0xA2A5 0x251D
> +0xA2A6 0x253F
> +0xA2A7 0x2525
>
> The above all looks like pure nonstandard Mozilla behavior. What's
> next is more interesting:
>
> -0xA2CC 0x5341
> -0xA2CD 0x5344
> -0xA2CE 0x5345
> +0xA2CC 0x3038
> +0xA2CD 0x3039
> +0xA2CE 0x303A
>
> This looks to me like an actual bug in CP905.TXT from Unicode.
> Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
> for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
> which is in itself almost certainly a bug.

The move from Unihan to Symbol of 0xA2CC-0xA2CE is approved in Big5-2003  
meeting, minutes vote #1:
http://web.archive.org/web/20050307160806/http://pingyeh.net/big5/big5-2003-v3/1106-2003-note.txt

>
> +0xA3C0 0x2400
> +0xA3C1 0x2401
> +...
> +0xA3DF 0x241F
> +0xA3E0 0x2421
>
> These are all part of ETEN omitted from CP950, and should definitely
> be in Big5 base.

Those are control symbols. Listed in summary.txt

>
> +0xC6A1 0x2460
> +0xC6A2 0x2461
> +0xC6A3 0x2462
> +0xC6A4 0x2463
> +...
> +0xC7F1 0x30F5
> +0xC7F2 0x30F6
>
> These are also from ETEN. Notably, the Cyrillic block that immediately
> follows these is still omitted in Big5-2003, for reasons that appear
> political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> to add the Cyrillic block back in here.

Government said Cyrillic block is not in CNS 11643, so removed.
But YES we should have it.

>
> Finally:
>
> -0xF9FA 0x256D
> -0xF9FB 0x256E
> -0xF9FC 0x2570
> -0xF9FD 0x256F
> +0xF9FA 0x2554
> +0xF9FB 0x2557
> +0xF9FC 0x255A
> +0xF9FD 0x255D
>
> This looks like pure Mozilla cruft. Is there any justification for
> these sorts of changes?

Another vote in meeting, minutes vote #2.

Big5-2003 draft v3 (should be identical to released version) for reference:
http://web.archive.org/web/20051017043312/http://pingyeh.net/big5/big5-2003-v3/big5-2003-draft-v3.csv

BTW MS votes in the meeting but she didn't use Big5-2003 in her newer  
OSes, WTF?

>
> Does the above analysis look correct? If so I will go ahead and merge
> the above changes to Big5 support into musl.

Mostly correct IMHO.

>
> BTW, the only non-PUA part of UAO within the standard Big5 range
> (89x157 grid) that won't be mapped with these changes is the stuff
> right after the Cyrillic block. This part does not conflict with
> current HKSCS, so if I had good sources from both the Taiwan and HK
> sides supporting the position that these mappings will not conflict
> with other extensions in current use or with future expansion of
> HKSCS, we could consider including that part of UAO in the base Big5
> mapping. At this point this is only an idea for consideration, but we
> can keep it in mind.
>
>> Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.
>> cswsmin.tte), which are not mapped to Unicode and the codepoint that
>> has not occupied by HKSCS codepoints.
>
> So basically dingbats?

No only dingbats AFAIK.

>
> Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.