musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805032452.280127fd@ralda.gmx.de>
Date: Mon, 5 Aug 2013 03:24:52 +0200
From: Harald Becker <ralda@....de>
Cc: musl@...ts.openwall.com, nsz@...t70.net
Subject: Re: iconv Korean and Traditional Chinese research so far

Hi !

05-08-2013 02:44 Szabolcs Nagy <nsz@...t70.net>:

> * Harald Becker <ralda@....de> [2013-08-05 00:39:43 +0200]:
> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> external files provided by libc can work but they
> should be possible to embed into the binary

As far as I know, does glibc create small dynamically linked
objects and load those when required. This is architecture
specific. So you always need conversion files which correspond
to your C library.

My intention is to write conversion as a machine independent byte
code, which may be copied between machines of different
architecture. You need a charset conversion, just add the charset
bytecode to the conversion directory, which may be configurable
(directory name from environ variable with default fallback). May
even be a search path for conversion files, so conversion files
may be installed in different locations.

> otherwise a static binary is not self-contained
> and you have to move parts of the libc around
> along with the binary and if they are loaded
> from fixed path then it does not work at all
> (permissions, conflicting versions etc)

Ok, I see the static linking topic, but this is no problem with
byte code conversion programs. It can easily be added: Just add
all the conversion byte code programs together to a single big
array, with a name and offset table ahead, then link it into your
program.

May be done in two steps:

1) Create a selection file for musl build, and include the
specified charsets in libc.a/.so

2) Select the required charset files and create an .o file to
link into your program.


iconv then shall:
- look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
- search table of with libc linked charsets
- search table of with the program linked charsets
- search for charset on external search path

... or do in opposite direction and use first charset
conversion found.

This lookup is usually very small, except file system search, so
it shall not produce much overhead / bloat.

[Addendum after thinking a bit more: The byte code conversion
files shall exist of a small statical header, followed by the
byte code program. The header shall contain the charset name,
version of required virtual machine and length of byte code. So
you need only add all such conversion files to a big array of
bytes and add a Null header to mark the end of table. Then you
only need the start of the array and you are able to search
through for a specific charset. The iconv function in libc
contains a definition for an "unsigned char const
*iconv_user_charsets = NULL;", which is linked in, when the user
does not provide it's own definition. So iconv can search all
linked in charset definitions, and need no code changes. Really
simple configuration to select charsets to build in.]

> if the format changes then dynamic linking is
> problematic as well: you cannot update libc
> in a single atomic operation

The byte code shall be independent of dynamic linking. The
conversion files are only streams of bytes, which shall also be
architecture independent. So you do only need to update the
conversion files if the virtual machine definition of iconv has
been changed (shall not be done much). External files may be read
into malloc-ed buffers or mmap-ed, not linked in by the
dynamical linker.

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.