musl - Status of Big5 and extensions

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130807165044.GA14867@brightrain.aerifal.cx>
Date: Wed, 7 Aug 2013 12:50:44 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Status of Big5 and extensions

OK, so after a lot of research and discussion, I'm about to commit the
first part of Big5 support in iconv. The plain "Big5" charset name is
going to be the maximal set everybody agrees on, which, as far as I
can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
differs from CP950 in a few places, but it's just wrong. The one
ideograph where it differs conflicts with Unihan.txt, which is
authoritative on which Unicode characters encode which character
identities from historical CJK charsets.)

As for extensions, my understanding of HKSCS is to a point now where I
feel we can add it (charset name "Big5-HKSCS"), based on the 2008
government publication which has a few new characters beyond the old
versions in most software. (Thank you nsz for helping dig up all the
files and researching how they differ!) However there are a few
technical difficulties to implementation: the Unicode codepoints span
a 17-bit range rather than just a 16-bit range, so we need an
efficient way of doing the mappings. What's worse, several HKSCS
codepoints map to Latin characters with multiple combining marks which
have no precomposed representation. Supporting these will require
extending iconv to be able to output two or more characters of output
for each unit of input, which is mildly error-prone with the current
design, so I may hold off on HKSCS support until I overhaul some of
the core logic of the iconv function.

Now, the hard part: Taiwan extensions. I appreciate all the help from
Roy, but I'm still not to the point of having anything nearly ready to
be added. The UAO extension set has at least 290 mappings to
codepoints in the Unicode Private Use Areas (PUA), which makes it
unsuitable for inclusion as-is. This may be a resolvable issue if
these 290 characters all exist in Unicode and could be remapped, but I
do not feel any of us are qualified to determine this. So UAO is
probably still a long way away from being able to be adopted into
iconv, unless we have authoritative data on the identities of these
characters in the form of something from an official standards body or
at the very least multiple major vendors (enough to ensure that any
future standardization would be consistent and non-controversial).

If there are other supersets of CP950 (possibly subsets of UAO) which
would be useful for supporting Taiwanese users, I would very much like
to understand the situation on them and whether it's feasible to
include them in iconv at this point. Perhaps the most important thing
to know at this point is what practical difficulties exist for
Taiwanese users limited to CP950, and which extension charsets could
solve this. I feel like issues like "I cannot write my own name" are
on a completely different level than "I can't mix Japanese text in
with my legacy-encoded data [when I really should be using Unicode for
this anyway]". The former sort of issue is something that demands some
sort of support, even if imperfect or mildly hackish, whereas the
latter is not a justification for hacks and bypassing proper standards
processes.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.