musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805035312.5d874012@ralda.gmx.de>
Date: Mon, 5 Aug 2013 03:53:12 +0200
From: Harald Becker <ralda@....de>
Cc: musl@...ts.openwall.com, dalias@...ifal.cx
Subject: Re: iconv Korean and Traditional Chinese research so far

Hi Rich !

04-08-2013 20:49 Rich Felker <dalias@...ifal.cx>:

> Do I want to add that size? No, of course not, and that's why
> I'm hoping (but not optimistic) that there may be a way to
> elide a good part of the table based on patterns in the Hangul
> syllables or the possibility that the giant extensions are
> unimportant.

I think there is a way for easy configuration. See other mails,
they clarify what my intention is.

> Do I want to give users who have large volumes of legacy text
> in their languages stored in these encodings the same respect
> and dignity as users of other legacy encodings we already
> support? Yes.

Of course. I won't dictate others which conversions they want to
use. I only hat to have plenty of conversion tables on my system
when I really know I never use such kind of conversions. ... but
in case I really need, it can be added dynamically to the running
system.


> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> This issue was discussed a long time ago and the consensus
> among users of static linking was that static linking is most
> valuable when it makes the binary completely "portable" to
> arbitrary Linux systems for the same cpu arch, without any
> dependency on having files in particular locations on the
> system aside from the minimum required by POSIX (things
> like /dev/null), the standard Linux /proc mountpoint, and
> universal config files like /etc/resolv.conf (even that is not
> necessary, BTW, if you have a DNS on localhost). Having iconv
> not work without external character tables is essentially a
> form of dynamic linking, and carries with it issues like where
> the files are to be found (you can override that with an
> environment variable, but that can't be permitted for setuid
> binaries), what happens if the format needs to change and the
> format on the target machine is not compatible with the libc
> version your binary was built with, etc. This is also the main
> reason musl does not support something like nss.

I see the topic of self contained linking, and you are right that
is is required, but it is fully possible to have best of both
worlds without much overhead. Writing iconv as a virtual machine
interpreter allows to statical link in the conversion byte code
programs. Those who are not linked in, can be searched for in the
filesystem. And a simple configuration option may disable file
system search completely, for really small embedded operation.
But beside this all conversions are the same and may be
freely copied between architectures, or linked statically into a
user program (just put byte stream of selected charsets into
simple C array of bytes).

> Another side benefit of the current implementation is that it's
> fully self-contained and independent of any system facilities.
> It's pure C and can be taken out of musl and dropped in to any
> program on any C implementation, including freestanding
> (non-hosted) implementations. If it depended on the filesystem,
> adapting it for such usage would be a lot more work.

The virtual machine shall be written in C, I've done such type of
programming many times. So resulting code will compile with any C
compiler, and byte code programs are just array of bytes,
independent of machine byte order. So you will have any further
dependencies.

> A fsm implementation would be several times larger than the
> implementations in iconv.c.

A bit larger, yes ... but not so much, if virtual machine gets
designed carefully, and it will not increase in size, when there
are more charsets get added (only size of byte code program
added).


> It's possible that we could, at some time in the future,
> support loading of user-defined character conversion files as
> an added feature, but this should only be for really
> special-purpose things like custom encodings used for games or
> obsolete systems (old Mac, console games, IBM mainframes, etc.).

We can have it all, with not much overhead. And it is not only
for such special cases. I don't like to install musl on my
systems with Japanese, Chinese or Korean conversions, but in case
I really need, I'm able to throw them in, without much work.

... and we can add every character conversion on the fly, without
rebuild of the library.

> In terms of the criteria for what to include in musl itself, my
> idea is that if you have a mail client or web browser based on
> iconv for its character set handling, you should be able to
> read the bulk of content in any language.

If you are building a mail client or web browser, but what if you
want to include the possibility of charset conversion but stay at
small size, just including conversions for only system relevant
conversions, but not limiting to those. Any other conversion can
then be added on the fly.

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.