musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805033955.GC221@brightrain.aerifal.cx>
Date: Sun, 4 Aug 2013 23:39:56 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far

On Mon, Aug 05, 2013 at 03:53:12AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 04-08-2013 20:49 Rich Felker <dalias@...ifal.cx>:
> 
> > Do I want to add that size? No, of course not, and that's why
> > I'm hoping (but not optimistic) that there may be a way to
> > elide a good part of the table based on patterns in the Hangul
> > syllables or the possibility that the giant extensions are
> > unimportant.
> 
> I think there is a way for easy configuration. See other mails,
> they clarify what my intention is.

I saw, and you're free to write such an iconv implementation if you
like, but it's not right for musl. Inventing elaborate mechanisms to
solve simple problems is the glibc way of doing things, not the musl
way.

iconv is not something that needs to be extensible. There is a finite
set of legacy encodings that's relevant to the world, and their
relevance is going to go down and down with time, not up.

> > Do I want to give users who have large volumes of legacy text
> > in their languages stored in these encodings the same respect
> > and dignity as users of other legacy encodings we already
> > support? Yes.
> 
> Of course. I won't dictate others which conversions they want to
> use. I only hat to have plenty of conversion tables on my system
> when I really know I never use such kind of conversions.

And your table for just Chinese is as large as all our tables
combined...

I agree you can make iconv smaller than musl's in the case where _no_
legacy DBCS are installed. But if you have just one, you'll be just as
large or larger than musl with them all. Just compare the size of
musl's tables to glibc's converters. I've worked hard to make them as
small as reasonably possible without doing hideous hacks like
decompression into an in-memory buffer, which would actually increase
bloat.

> ... but
> in case I really need, it can be added dynamically to the running
> system.

If you have root or want to setup nonstandard environment variables.

> > This issue was discussed a long time ago and the consensus
> > among users of static linking was that static linking is most
> > valuable when it makes the binary completely "portable" to
> > arbitrary Linux systems for the same cpu arch, without any
> > dependency on having files in particular locations on the
> > system aside from the minimum required by POSIX (things
> > like /dev/null), the standard Linux /proc mountpoint, and
> > universal config files like /etc/resolv.conf (even that is not
> > necessary, BTW, if you have a DNS on localhost). Having iconv
> > not work without external character tables is essentially a
> > form of dynamic linking, and carries with it issues like where
> > the files are to be found (you can override that with an
> > environment variable, but that can't be permitted for setuid
> > binaries), what happens if the format needs to change and the
> > format on the target machine is not compatible with the libc
> > version your binary was built with, etc. This is also the main
> > reason musl does not support something like nss.
> 
> I see the topic of self contained linking, and you are right that
> is is required, but it is fully possible to have best of both
> worlds without much overhead. Writing iconv as a virtual machine

It's not the best of both worlds. It's essentially the same as dynamic
linking.

> interpreter allows to statical link in the conversion byte code
> programs.

At several times the size of the current code/tables, and after the
user searches through the documentation to figure out how to do it.

> > Another side benefit of the current implementation is that it's
> > fully self-contained and independent of any system facilities.
> > It's pure C and can be taken out of musl and dropped in to any
> > program on any C implementation, including freestanding
> > (non-hosted) implementations. If it depended on the filesystem,
> > adapting it for such usage would be a lot more work.
> 
> The virtual machine shall be written in C, I've done such type of
> programming many times. So resulting code will compile with any C
> compiler, and byte code programs are just array of bytes,
> independent of machine byte order. So you will have any further
> dependencies.

It's not just a matter of dropping in. You'd have path searches to
modify or disable, build options to get the static tables turned on,
and all of this stuff would have to be integrated with the build
system for what you're dropping it into.

Complexity is never the solution. Honestly, I would take a 1mb
increase in binary size over this kind of complexity any day.
Thankfully, we don't have to make such a tradeoff.

> > A fsm implementation would be several times larger than the
> > implementations in iconv.c.
> 
> A bit larger, yes ... but not so much, if virtual machine gets
> designed carefully, and it will not increase in size, when there
> are more charsets get added (only size of byte code program
> added).

Charsets are not added. The time of charsets is over. It should have
been over in 1992, when Pike and Thompson made them obsolete, but it's
really over now.

> > It's possible that we could, at some time in the future,
> > support loading of user-defined character conversion files as
> > an added feature, but this should only be for really
> > special-purpose things like custom encodings used for games or
> > obsolete systems (old Mac, console games, IBM mainframes, etc.).
> 
> We can have it all, with not much overhead. And it is not only
> for such special cases. I don't like to install musl on my
> systems with Japanese, Chinese or Korean conversions, but in case
> I really need, I'm able to throw them in, without much work.
> 
> .... and we can add every character conversion on the fly, without
> rebuild of the library.

Maybe we should also include a bytecode interpreter for doing hostname
lookups, since you might want to do something other than DNS or a
hosts file. And a bytecode interpreter for user database lookups in
place of passwd files. And a bytecode interpreter for adding new
crypt() algorithms. And...

> > In terms of the criteria for what to include in musl itself, my
> > idea is that if you have a mail client or web browser based on
> > iconv for its character set handling, you should be able to
> > read the bulk of content in any language.
> 
> If you are building a mail client or web browser, but what if you
> want to include the possibility of charset conversion but stay at
> small size, just including conversions for only system relevant
> conversions, but not limiting to those. Any other conversion can
> then be added on the fly.

Then dynamic link it. If you want an extensible binary, you use
dynamic linking. The main reason for static linking is when you want a
binary whose behavior does not change with the runtime environment --
for example, for security purposes, for carrying around to other
machines that don't have the same runtime environment, etc.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.