Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Wed, 28 Aug 2013 08:57:24 +0800
From: Roy <roytam@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Re: Re: Big5 "mostly" complete

On Tue, 27 Aug 2013 09:53:49 +0800, Rich Felker <dalias@...ifal.cx> wrote:

> On Sun, Aug 18, 2013 at 07:19:57PM +0800, Roy wrote:
>> On Sun, 18 Aug 2013 15:32:29 +0800, Rich Felker <dalias@...ifal.cx>  
>> wrote:
>>
>> >On Sun, Aug 18, 2013 at 12:20:47PM +0800, Roy wrote:
>> >>Both Big5-UAO and Big5-HKSCS are needed for those Taiwan people and
>> >>Hong Kong people.
>> >>For Big5-UAO, there is some commonly used dingbats(for example "♡"
>> >>mark) and numeric representations(for example "①") are in Big5-UAO
>> >>but not in CP950.
>> >>and Big5-UAO is still being used not only in ptt.cc telnet BBS, but
>> >>also in  text data files(file lists/cue sheets) because of
>> >>not-supporting UTF-8 in applications(for example, Perl File-system
>> >>I/O in windows, CD-Rippers).
>> >>for Big5-HKSCS, it use used for storing commonly used Cantonese
>> >>ideographs (for example, "𨋢" means "lift" in Cantonese) in Hong
>> >>Kong.
>> >
>> >HKSCS is supported as of yesterday's commit. I'm aware that it's
>> >needed for representing Cantonese language in Big5, and that it's
>> >widely used on the web.
>> >
>> >What I'm not clear on is the necessity of UAO. Keep in mind that iconv
>> >is an API for information interchange: things like interpreting web
>> >content, email, old text files, etc. The fact that UAO exists is not
>> >alone reason to support it; it has to actually have usefulness in
>> >situations where the iconv interface should be used. If you want to
>> >see it included, this is what you need to convince us of:
>> >
>> >- That it's in widespread use in large volumes of existing data (on
>> >  the web, text files, etc.) or data that is being newly generated
>> >  (e.g. as a default encoding of popular mail software).
>>
>> People are told *NOT* to publish file with Big5-UAO to the web(or
>> say, people, even the creator of UAO, appeal to people that not to
>> publish file with Big5-UAO to the web), but still there are some
>> that's in archive format.(Like I said before, for example cue-sheet
>> file of CD-ROM image, etc.)
>> But for local data processing, UAO does facilitate file managing to
>> windows users.
>
> Based on this, I think:
>
> (1) It's reasonable to omit UAO for now, and
> (2) Support for iconv to load user-defined character mappings would be
> a worthwhile feature to work on post-1.0.
>

That is good. But I have few feature request about this:
- user-defined mapping can be overlayed to another coding, just like HKSCS  
does.
- user-defined mapping can be embedded to static-linked binary.

And for Unicode to CJK legacy encodings is a must (hope it is available  
before musl-1.0)

> My reasoning is that the goal of iconv in musl, at least for the
> built-in character set conversions, is to facilitate information
> interchange, particularly reading of data that may be received in
> email, as documents published on the web, via IRC or IM protocols,
> etc. An encoding whose creators specifically request that it NOT be
> used for publishing/interchange is well outside this scope.

Yeah it is not encouraged for publishing since it is not a standard and  
people are not encouraged to install UAO blindly, but people do use it for  
private interchange(like sending files via ftp/instant messaging)

>
> I agree with your examples (CD-ROM cue sheets, archived text files,
> that telnet BBS, etc.) that there is a need by some users to
> process/import data encoded in UAO, but most of these usages do not
> seem to require general applications, treating charsets in an
> abstract, MIME-style manner, to be able to handle it. For many of the
> examples, a command-line conversion utility (BTW, there are ones much
> more powerful than iconv out there) would be the logical choice. For
> the BBS, my understanding is that most of its users are using special
> telnet/terminal apps with the conversion built-in.
>
>> >- That it's necessary to represent linguistic content in languages
>> >  used in Taiwan, not just as a substitute for Unicode to represent
>> >  foreign languages.
>>
>> It does, some Chinese ideographs are used as part of name, but not
>> in CP950 mapping like "喆" and "堃".
>
> How do these users send email or enter their names in web-based apps?
> My guess would be that the email clients switch to UTF-8 when
> encountering a character they can't encode in Big5, and that,
> nowadays, most web apps are built on CMS that are Unicode-based. Is
> this correct?
>

Yes, most popular web apps are using UTF-8 nowadays.
In the past, people enter (方方土) as 堃 and (吉吉) as 喆, and they may  
install ChinaSea/UAO/etc. charset extensions for 堃 and 喆 as well.

>> >- That failure to support it would put musl's iconv in a worse
>> >  position of compatibility than other iconv implementations or
>> >  software-specific (e.g. in-browser) character set conversions.
>>
>> Since people made Big5-UAO patch for libiconv and glibc(gconv)
>> unofficially to meet their uses, if musl libc have an optional
>> Big5-UAO mapping will be an advantage to Taiwan people.
>
> *nod*
>
> For what it's worth, how do those patches handle it? Do they add a new
> "Big5-UAO" charset name to iconv, or do they modify the existing Big5
> to treat it as UAO?

The original patches by Tiberius Teng modify Big5 with Big5-UAO mappings.
I'm trying to reach Tiberius and get the patch if available.

And there is another libiconv patch that adds big5-uao encoding instead.
http://ku.myftp.org/goods/libiconv-1.11-uao.patch.bz2

>
> My feeling for now is to increase the priority of adding custom local
> charmap files to iconv after musl 1.0 is released. My main reason is
> that "intended for information interchange" vs "intended only for
> local use" seems to be the best guideline for whether an encoding is
> appropriate to include built-in.
>
> Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.