musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805004915.GA221@brightrain.aerifal.cx>
Date: Sun, 4 Aug 2013 20:49:15 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far

On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > Worst-case, adding Korean and Traditional Chinese tables will
> > roughly double the size of iconv.o to around 150k. This will
> > noticably enlarge libc.so, but will make no difference to
> > static-linked programs except those using iconv. I'm hoping we
> > can make these additions less expensive, but I don't see a good
> > way yet.
> 
> Oh nooo, do you really want to add this statically to the iconv
> version?

Do I want to add that size? No, of course not, and that's why I'm
hoping (but not optimistic) that there may be a way to elide a good
part of the table based on patterns in the Hangul syllables or the
possibility that the giant extensions are unimportant.

Do I want to give users who have large volumes of legacy text in their
languages stored in these encodings the same respect and dignity as
users of other legacy encodings we already support? Yes.

> Why cant we have all this character conversions on a state driven
> machine which loads its information from a external configuration
> file? This way we can have any kind of conversion someone likes,
> by just adding the configuration file for the required Unicode to
> X and X to Unicode conversions.

This issue was discussed a long time ago and the consensus among users
of static linking was that static linking is most valuable when it
makes the binary completely "portable" to arbitrary Linux systems for
the same cpu arch, without any dependency on having files in
particular locations on the system aside from the minimum required by
POSIX (things like /dev/null), the standard Linux /proc mountpoint,
and universal config files like /etc/resolv.conf (even that is not
necessary, BTW, if you have a DNS on localhost). Having iconv not work
without external character tables is essentially a form of dynamic
linking, and carries with it issues like where the files are to be
found (you can override that with an environment variable, but that
can't be permitted for setuid binaries), what happens if the format
needs to change and the format on the target machine is not compatible
with the libc version your binary was built with, etc. This is also
the main reason musl does not support something like nss.

Another side benefit of the current implementation is that it's fully
self-contained and independent of any system facilities. It's pure C
and can be taken out of musl and dropped in to any program on any C
implementation, including freestanding (non-hosted) implementations.
If it depended on the filesystem, adapting it for such usage would be
a lot more work.

> State driven fsm interpreters are really small and fast and may
> read it's complete configuration from a file ... architecture
> independent file, so we may have same character conversion files
> for all architectures.

A fsm implementation would be several times larger than the
implementations in iconv.c. It's possible that we could, at some time
in the future, support loading of user-defined character conversion
files as an added feature, but this should only be for really
special-purpose things like custom encodings used for games or
obsolete systems (old Mac, console games, IBM mainframes, etc.).

In terms of the criteria for what to include in musl itself, my idea
is that if you have a mail client or web browser based on iconv for
its character set handling, you should be able to read the bulk of
content in any language.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.