Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 6 Aug 2013 09:32:05 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: Re: iconv Korean and Traditional Chinese research so
 far

On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote:
> >My impression (please correct me if I'm wrong) is that you can't use
> >Big5-UAO as the system encoding on modern versions of Windows (just
> >ancient ones where you install unmaintained third-party software that
> >hacks the system charset tables)
> 
> It doesn't "hack" the nls file but replaces with UAO-available CP950
> nls file.
> The executable(setup program) is generated with NSIS(Nullsoft
> Scriptable Install System).
> Since the nls file format doesn't change since NT 3.1 in 1993 till
> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will
> continue to work in newer versions of windows unless MS throw away
> nls file format with something different.

OK, thanks for clarifying. I'd still consider it a ways into the
"hack" domain if the OS vendor still is not supporting it directly,
but it does make a difference that it still works "cleanly". I was
under the impression that these sorts of things changes between
Windows versions in ways that would preclude using old, unmaintained
patches like this. I agree that just the fact that certain OS vendors
do not support an encoding is not in itself a reason not to support
it.

> >and that it's not supported in GNU
> >libiconv. If this is the case, and especially if Big5-UAO's main use
> >is on a telnet-based BBS where everybody is using special telnet
> >clients that have their own Big5-UAO converters,
> 
> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful
> SBCS+DBCS)!
> 
> So does it matter if GNU libiconv is not support whatever encodings?
> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC
> SBCS and stateful SBCS+DBCS encodings)

I was under the impression that GNU libiconv was in sync with glibc's
iconv, but I have not checked this. I actually was more interested in
glibc's, which is in widespread use. glibc's inclusion or exclusion of
a feature is not in itself a reason to include or exclude it, but
supporting something that glibc supports does have the added
motivation that it will increase compatibility with what programs are
expecting.

> >I'd find it really
> >hard to justify trying to support this. But I'm open to hearing
> >arguments on why we should, if you believe it's important.
> 
> I think it will be nice to have build/link time option for those
> "unpopular" encodings.
> 
> >>For static linking, can we have conditional linking like QT does?
> >
> >My feeling is that it's a tradeoff, and probably has more pros than
> >cons. Unlike QT, musl's iconv is extremely small.
> 
> I would add "right now" here. When we adds more encoding later,
> iconv module will be bigger than now, and people will need to find a
> way to conditionally compiling the encoding they need (for both
> dynamically or statically)

It's never been my intent to add more encodings later (aside from pure
non-table-based variants of existing ones, like the ISO-2022 versions)
once coverage is complete, at least not as built-in features. This can
be discussed if you think there are reasons it needs to change, but up
until now, the plan has been to support:

- ISO-8859 based 8-bit encodings
- Other 8-bit encodings with actual legacy usage (mainly Cyrillic)
- JIS 0208 based encodings
- KS X 1001 based encodings
- GB 2312 and supersets
- Big5 and supersets

All of those except Big5 and supersets are now supported, so short of
any change, my position is that right now we're discussing the "last"
significant addition to musl's iconv.

Some things that are definitely outside the scope of musl's iconv:

- Anything whose characters are not present in Unicode
- Anything PUA-based (really, same as above)
- Newly invented encodings with no historical encoded data

What's more borderline is where UAO falls: encodings that have neither
governmental or language-body-authority support nor any vendor support
from other software vendors, but for which there is at least one major
corpus of historical data and/or current usage for the encoding by
users of the language(s) whose characters are encoded.

However, based on the file at

http://moztw.org/docs/big5/table/uao250-b2u.txt

a number of the mappings UAO defines are into the private use area.
This would generally preclude support (as this is a font-specific
encoding, not a Unicode encoding) unless the affected characters have
since been added to Unicode and could be remapped to the correct
codepoints. Do you know the status on this?

I'm also still unclear on whether this is a superset of HKSCS (it's
definitely not directly, but maybe it is if the PUA mappings are
corrected; I did not do any detaield checks but just noted the lack of
mappings to the non-BMP codepoints HKSCS uses).

> >Even with all the
> >above, the size of iconv.o will be under 130k, maybe closer to 110k.
> >If you actually use iconv in your program, this is a small price to
> >pay for having it fully functional. On the other hand, if linking it
> >is conditional, you have to consider who makes the decision, and when.
> >If it's at link time for each application, that's probably too much of
> >a musl-specific version.
> 
> Since statically linking libc-iconv is new area now (other libc
> doesn't touch this topic much), I think we can create standard for
> statically linking specified encoding table in link time.
> (This is also a reason of "why libc should provide an unique
> identifier with preprocessor define")

I don't see how "creating a standard" for doing this would make the
situation any better. Most software authors these days are at best
tolerant of the existing of static linking, and more often hostile to
it. They're not going to add specific build behavior for static
linking, and even if they do, they're likely to get it wrong, in which
case the user ends up stuck with binaries that can't process input in
their language.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.