musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805090343.6d2f9f00@ralda.gmx.de>
Date: Mon, 5 Aug 2013 09:03:43 +0200
From: Harald Becker <ralda@....de>
Cc: musl@...ts.openwall.com, nsz@...t70.net
Subject: Re: iconv Korean and Traditional Chinese research so far

Hi !

05-08-2013 05:13 Szabolcs Nagy <nsz@...t70.net>:

> * Harald Becker <ralda@....de> [2013-08-05 03:24:52 +0200]:
> > iconv then shall:
> > - look for some fixed charsets like ASCII, Latin-1, UTF-8,
> > etc.
> > - search table of with libc linked charsets
> > - search table of with the program linked charsets
> > - search for charset on external search path
> 
> sounds like a lot of extra management cost
> (for libc, application writer and user as well)

This is not so much work. You already need to search for the
character set table to use, that is you need to search at least a
table of string values to find the pointer to the conversion
table. Searching a table in may statement above, means just
waking a pointer chain doing the string compares to find a
matching character set. Not much difference to the really
required code. Now do this twice to check possible user chain,
is just one more helper function call.

The only code that get a bit more, is the file system search.
This depends if we only try single location or walk through a
search path list. But this is the cost of flexibility to
dynamically load character set conversions (which I would really
prefer for seldom used char sets).

... and for application writer it is only more, if he likes to
add some charset tables into his program, which are not in
statical libc.

The problem is, all tables in libc need to be linked to your
program, if you include iconv. So each added charset conversion
increases size of your program ... and I definitly won't include
Japanese, Chinese or Korean charsets in my program. No that I
ignore those peoples need, I just wont need it, so I don't like
to add those conversions to programs sitting on my disk.

> it would be nice if the compiler could figure out
> at build time (eg with lto) which tables are used
> but i guess charsets often only known at runtime

How do you want to do this? And how shall the compiler know which
char sets the user may use during operation? So the only way to
select the charset tables to include in your program, is by
assuming ahead, which tables might be used. That is part of the
configuration of musl build or application program build.

> > [Addendum after thinking a bit more: The byte code conversion
> > files shall exist of a small statical header, followed by the
> > byte code program. The header shall contain the charset name,
> > version of required virtual machine and length of byte code.
> > So you need only add all such conversion files to a big array
> > of bytes and add a Null header to mark the end of table. Then
> > you only need the start of the array and you are able to
> > search through for a specific charset. The iconv function in
> > libc contains a definition for an "unsigned char const
> > *iconv_user_charsets = NULL;", which is linked in, when the
> > user does not provide it's own definition. So iconv can
> > search all linked in charset definitions, and need no code
> > changes. Really simple configuration to select charsets to
> > build in.]
> > 
> 
> yes that can work, but it's a musl specific hack
> that the application programmer need to take care of

Only if application programmer wants to add a char set to the
statical build program, which is not in libc, some extra work
has to be done. Giving some more flexibility. If you don't care,
you get the musl build in list of char sets.
 
> > > if the format changes then dynamic linking is
> > > problematic as well: you cannot update libc
> > > in a single atomic operation
> > 
> > The byte code shall be independent of dynamic linking. The
> > conversion files are only streams of bytes, which shall also
> > be architecture independent. So you do only need to update the
> > conversion files if the virtual machine definition of iconv
> > has been changed (shall not be done much). External files may
> > be read into malloc-ed buffers or mmap-ed, not linked in by
> > the dynamical linker.
> > 
> 
> that does not solve the format change problem
> you cannot update libc without race
> (unless you first replace the .so which supports
> the old format as well as the new one, but then
> libc has to support all previous formats)

If the definition of the iconv virtual state machine is modified,
you need to do extra care on update (delete old charset files,
install new lib, install new charset files, restart system) ...
but this is only required on a major update. As soon as the
virtual machine definition gots stabilized you do not need to
change charset definition files, or just do update your lib, then
update possible new charset files. After an initial phase of
testing this shall happen relatively seldom, that the virtual
machine definition needs to be changed in an incompatible manner.
And simple extending the virtual machine does not invalidate the
old charset files.

> it's probably easy to design a fixed format to
> avoid this

A fixed format? For what? Do you know the differences of char
sets, especially multi byte char sets?

> it seems somewhat similar to the timezone problem
> ecxept zoneinfo is maintained outside of libc so
> there is not much choice, but there are the same
> issues: updating it should be done carefully,
> setuid programs must be handled specially etc

Again. As soon as the virtual machine definition has reached a
stable state, it shall not happen much, that any change
invalidates a charset definition file. That is at least old files
will continue to work with newer lib versions. So there is no
problem on update, just update your lib then update your charset
files. The only problem will be, if a still running application
uses a new charset file with an old version of the lib. This will
be detected and leads to a failure code of iconv. So you need to
restart your application ... which is always a good decision as
you updated your lib.

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.