Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 5 Aug 2013 01:00:29 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far

On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote:
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
> 
> Korean:
> CP949
> Lead byte range is extended to 81-FD (125)
> Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
> 44500 bytes table space
> 
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
> 
> The big remaining questions are:
> 
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For what it's worth, there is no IANA charset registration for any
supplement to Korean. See the table here:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

The only entries for Korean are ISO-2022-KR and EUC-KR.

Big5-HKSCS however is registered. This matches my intuition that, of
the two, HKSCS would be more important to real-world usage than Korean
extensions.

If we were to omit CP949 and just go with KS X 1001, but include
HKSCS, the total size (minus a minimal amount of code needed) would be
17484+37366 = 54850.

With both supported, it would be 44500+37366 = 81866.

With just KS X 1001 and base Big5, it would be 17484+27946 = 45430.

Being that HKSCS is a standard, registered MIME charset and the cost
is only 10k, and that it seems necessary for real world usage in Hong
Kong, I think it's pretty obvious that we should support it. So I
think the question we're left with is whether the CP949 (MS encoding)
extension for Korean is important to support. The cost is roughly 37k.

I'm going to keep doing research to see if identifying the characters
added in it sheds any light on whether there are important additions.
Obviously I would like to be able to exclude it but I don't want this
decision to be made unfairly based on my bias when it comes to bloat.
:)

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.