Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 06 Aug 2013 23:11:23 +0800
From: Roy <roytam@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Re: Re: iconv Korean and Traditional Chinese research so far

On Tue, 06 Aug 2013 21:32:05 +0800, Rich Felker <dalias@...ifal.cx> wrote:

> On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote:
>> >My impression (please correct me if I'm wrong) is that you can't use
>> >Big5-UAO as the system encoding on modern versions of Windows (just
>> >ancient ones where you install unmaintained third-party software that
>> >hacks the system charset tables)
>>
>> It doesn't "hack" the nls file but replaces with UAO-available CP950
>> nls file.
>> The executable(setup program) is generated with NSIS(Nullsoft
>> Scriptable Install System).
>> Since the nls file format doesn't change since NT 3.1 in 1993 till
>> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will
>> continue to work in newer versions of windows unless MS throw away
>> nls file format with something different.
>
> OK, thanks for clarifying. I'd still consider it a ways into the
> "hack" domain if the OS vendor still is not supporting it directly,
> but it does make a difference that it still works "cleanly". I was
> under the impression that these sorts of things changes between
> Windows versions in ways that would preclude using old, unmaintained
> patches like this. I agree that just the fact that certain OS vendors
> do not support an encoding is not in itself a reason not to support
> it.
>
>> >and that it's not supported in GNU
>> >libiconv. If this is the case, and especially if Big5-UAO's main use
>> >is on a telnet-based BBS where everybody is using special telnet
>> >clients that have their own Big5-UAO converters,
>>
>> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful
>> SBCS+DBCS)!
>>
>> So does it matter if GNU libiconv is not support whatever encodings?
>> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC
>> SBCS and stateful SBCS+DBCS encodings)
>
> I was under the impression that GNU libiconv was in sync with glibc's
> iconv, but I have not checked this. I actually was more interested in
> glibc's, which is in widespread use. glibc's inclusion or exclusion of
> a feature is not in itself a reason to include or exclude it, but
> supporting something that glibc supports does have the added
> motivation that it will increase compatibility with what programs are
> expecting.
>
>> >I'd find it really
>> >hard to justify trying to support this. But I'm open to hearing
>> >arguments on why we should, if you believe it's important.
>>
>> I think it will be nice to have build/link time option for those
>> "unpopular" encodings.
>>
>> >>For static linking, can we have conditional linking like QT does?
>> >
>> >My feeling is that it's a tradeoff, and probably has more pros than
>> >cons. Unlike QT, musl's iconv is extremely small.
>>
>> I would add "right now" here. When we adds more encoding later,
>> iconv module will be bigger than now, and people will need to find a
>> way to conditionally compiling the encoding they need (for both
>> dynamically or statically)
>
> It's never been my intent to add more encodings later (aside from pure
> non-table-based variants of existing ones, like the ISO-2022 versions)
> once coverage is complete, at least not as built-in features. This can
> be discussed if you think there are reasons it needs to change, but up
> until now, the plan has been to support:
>
> - ISO-8859 based 8-bit encodings
> - Other 8-bit encodings with actual legacy usage (mainly Cyrillic)
> - JIS 0208 based encodings
> - KS X 1001 based encodings
> - GB 2312 and supersets
> - Big5 and supersets
>
> All of those except Big5 and supersets are now supported, so short of
> any change, my position is that right now we're discussing the "last"
> significant addition to musl's iconv.
>
> Some things that are definitely outside the scope of musl's iconv:
>
> - Anything whose characters are not present in Unicode
> - Anything PUA-based (really, same as above)
> - Newly invented encodings with no historical encoded data
>
> What's more borderline is where UAO falls: encodings that have neither
> governmental or language-body-authority support nor any vendor support
> from other software vendors, but for which there is at least one major
> corpus of historical data and/or current usage for the encoding by
> users of the language(s) whose characters are encoded.
>
> However, based on the file at
>
> http://moztw.org/docs/big5/table/uao250-b2u.txt
>
> a number of the mappings UAO defines are into the private use area.
> This would generally preclude support (as this is a font-specific
> encoding, not a Unicode encoding) unless the affected characters have
> since been added to Unicode and could be remapped to the correct
> codepoints. Do you know the status on this?

Those are Big5-2003 compatibility code range. Big5-2003 is in CNS11643  
appendix section, but it is rarely used since no OS/Application supports  
it.
So skipping the PUA mappings are fine.

>
> I'm also still unclear on whether this is a superset of HKSCS (it's
> definitely not directly, but maybe it is if the PUA mappings are
> corrected; I did not do any detaield checks but just noted the lack of
> mappings to the non-BMP codepoints HKSCS uses).

No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO.

>
>> >Even with all the
>> >above, the size of iconv.o will be under 130k, maybe closer to 110k.
>> >If you actually use iconv in your program, this is a small price to
>> >pay for having it fully functional. On the other hand, if linking it
>> >is conditional, you have to consider who makes the decision, and when.
>> >If it's at link time for each application, that's probably too much of
>> >a musl-specific version.
>>
>> Since statically linking libc-iconv is new area now (other libc
>> doesn't touch this topic much), I think we can create standard for
>> statically linking specified encoding table in link time.
>> (This is also a reason of "why libc should provide an unique
>> identifier with preprocessor define")
>
> I don't see how "creating a standard" for doing this would make the
> situation any better. Most software authors these days are at best
> tolerant of the existing of static linking, and more often hostile to
> it. They're not going to add specific build behavior for static
> linking, and even if they do, they're likely to get it wrong, in which
> case the user ends up stuck with binaries that can't process input in
> their language.
>
> Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.