Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 26 Aug 2013 21:53:49 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: Re: Big5 "mostly" complete

On Sun, Aug 18, 2013 at 07:19:57PM +0800, Roy wrote:
> On Sun, 18 Aug 2013 15:32:29 +0800, Rich Felker <dalias@...ifal.cx> wrote:
> 
> >On Sun, Aug 18, 2013 at 12:20:47PM +0800, Roy wrote:
> >>Both Big5-UAO and Big5-HKSCS are needed for those Taiwan people and
> >>Hong Kong people.
> >>For Big5-UAO, there is some commonly used dingbats(for example "♡"
> >>mark) and numeric representations(for example "①") are in Big5-UAO
> >>but not in CP950.
> >>and Big5-UAO is still being used not only in ptt.cc telnet BBS, but
> >>also in  text data files(file lists/cue sheets) because of
> >>not-supporting UTF-8 in applications(for example, Perl File-system
> >>I/O in windows, CD-Rippers).
> >>for Big5-HKSCS, it use used for storing commonly used Cantonese
> >>ideographs (for example, "𨋢" means "lift" in Cantonese) in Hong
> >>Kong.
> >
> >HKSCS is supported as of yesterday's commit. I'm aware that it's
> >needed for representing Cantonese language in Big5, and that it's
> >widely used on the web.
> >
> >What I'm not clear on is the necessity of UAO. Keep in mind that iconv
> >is an API for information interchange: things like interpreting web
> >content, email, old text files, etc. The fact that UAO exists is not
> >alone reason to support it; it has to actually have usefulness in
> >situations where the iconv interface should be used. If you want to
> >see it included, this is what you need to convince us of:
> >
> >- That it's in widespread use in large volumes of existing data (on
> >  the web, text files, etc.) or data that is being newly generated
> >  (e.g. as a default encoding of popular mail software).
> 
> People are told *NOT* to publish file with Big5-UAO to the web(or
> say, people, even the creator of UAO, appeal to people that not to
> publish file with Big5-UAO to the web), but still there are some
> that's in archive format.(Like I said before, for example cue-sheet
> file of CD-ROM image, etc.)
> But for local data processing, UAO does facilitate file managing to
> windows users.

Based on this, I think:

(1) It's reasonable to omit UAO for now, and
(2) Support for iconv to load user-defined characer mappings would be
a worthwhile feature to work on post-1.0.

My reasoning is that the goal of iconv in musl, at least for the
built-in character set conversions, is to facilitate information
interchange, particularly reading of data that may be received in
email, as documents published on the web, via IRC or IM protocols,
etc. An encoding whose creators specifically request that it NOT be
used for publishing/interchange is well outside this scope.

I agree with your examples (CD-ROM cue sheets, archived text files,
that telnet BBS, etc.) that there is a need by some users to
process/import data encoded in UAO, but most of these usages do not
seem to require general applications, treating charsets in an
abstract, MIME-style manner, to be able to handle it. For many of the
examples, a command-line conversion utility (BTW, there are ones much
more powerful than iconv out there) would be the logical choice. For
the BBS, my understanding is that most of its users are using special
telnet/terminal apps with the conversion built-in.

> >- That it's necessary to represent linguistic content in languages
> >  used in Taiwan, not just as a substitute for Unicode to represent
> >  foreign languages.
> 
> It does, some Chinese ideographs are used as part of name, but not
> in CP950 mapping like "喆" and "堃".

How do these users send email or enter their names in web-based apps?
My guess would be that the email clients switch to UTF-8 when
encountering a character they can't encode in Big5, and that,
nowadays, most web apps are built on CMS that are Unicode-based. Is
this correct?

> >- That failure to support it would put musl's iconv in a worse
> >  position of compatibility than other iconv implementations or
> >  software-specific (e.g. in-browser) character set conversions.
> 
> Since people made Big5-UAO patch for libiconv and glibc(gconv)
> unofficially to meet their uses, if musl libc have an optional
> Big5-UAO mapping will be an advantage to Taiwan people.

*nod*

For what it's worth, how do those patches handle it? Do they add a new
"Big5-UAO" charset name to iconv, or do they modify the existing Big5
to treat it as UAO?

My feeling for now is to increase the priority of adding custom local
charmap files to iconv after musl 1.0 is released. My main reason is
that "intended for information interchange" vs "intended only for
local use" seems to be the best guideline for whether an encoding is
appropriate to include built-in.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.