Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 08 Aug 2013 08:18:45 +0800
From: Roy <roytam@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Status of Big5 and extensions

在 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker <dalias@...ifal.cx> 寫道:

> OK, so after a lot of research and discussion, I'm about to commit the
> first part of Big5 support in iconv. The plain "Big5" charset name is
> going to be the maximal set everybody agrees on, which, as far as I
> can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
> differs from CP950 in a few places, but it's just wrong. The one
> ideograph where it differs conflicts with Unihan.txt, which is
> authoritative on which Unicode characters encode which character
> identities from historical CJK charsets.)
>
> As for extensions, my understanding of HKSCS is to a point now where I
> feel we can add it (charset name "Big5-HKSCS"), based on the 2008
> government publication which has a few new characters beyond the old
> versions in most software. (Thank you nsz for helping dig up all the
> files and researching how they differ!) However there are a few
> technical difficulties to implementation: the Unicode codepoints span
> a 17-bit range rather than just a 16-bit range, so we need an
> efficient way of doing the mappings. What's worse, several HKSCS
> codepoints map to Latin characters with multiple combining marks which
> have no precomposed representation. Supporting these will require
> extending iconv to be able to output two or more characters of output
> for each unit of input, which is mildly error-prone with the current
> design, so I may hold off on HKSCS support until I overhaul some of
> the core logic of the iconv function.
>
> Now, the hard part: Taiwan extensions. I appreciate all the help from
> Roy, but I'm still not to the point of having anything nearly ready to
> be added. The UAO extension set has at least 290 mappings to
> codepoints in the Unicode Private Use Areas (PUA), which makes it
> unsuitable for inclusion as-is. This may be a resolvable issue if
> these 290 characters all exist in Unicode and could be remapped, but I
> do not feel any of us are qualified to determine this. So UAO is
> probably still a long way away from being able to be adopted into
> iconv, unless we have authoritative data on the identities of these
> characters in the form of something from an official standards body or
> at the very least multiple major vendors (enough to ensure that any
> future standardization would be consistent and non-controversial).
>

I did ask one of creator of UAO in  
http://forum.moztw.org/viewtopic.php?f=10&t=40174
And there is a reply of those PUA characters in Big5 view:
0xFA40 - 0xFA63: Reserved for user-defined characters
0xC8A5 - 0xC8B0: for Big5-2003 compliant
Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.  
cswsmin.tte), which are not mapped to Unicode and the codepoint that has  
not occupied by HKSCS codepoints.

> If there are other supersets of CP950 (possibly subsets of UAO) which
> would be useful for supporting Taiwanese users, I would very much like
> to understand the situation on them and whether it's feasible to
> include them in iconv at this point. Perhaps the most important thing
> to know at this point is what practical difficulties exist for
> Taiwanese users limited to CP950, and which extension charsets could
> solve this. I feel like issues like "I cannot write my own name" are
> on a completely different level than "I can't mix Japanese text in
> with my legacy-encoded data [when I really should be using Unicode for
> this anyway]". The former sort of issue is something that demands some
> sort of support, even if imperfect or mildly hackish, whereas the
> latter is not a justification for hacks and bypassing proper standards
> processes.
>
> Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.