Date: Mon, 5 Aug 2013 03:24:52 +0200 From: Harald Becker <ralda@....de> Cc: musl@...ts.openwall.com, nsz@...t70.net Subject: Re: iconv Korean and Traditional Chinese research so far Hi ! 05-08-2013 02:44 Szabolcs Nagy <nsz@...t70.net>: > * Harald Becker <ralda@....de> [2013-08-05 00:39:43 +0200]: > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > external files provided by libc can work but they > should be possible to embed into the binary As far as I know, does glibc create small dynamically linked objects and load those when required. This is architecture specific. So you always need conversion files which correspond to your C library. My intention is to write conversion as a machine independent byte code, which may be copied between machines of different architecture. You need a charset conversion, just add the charset bytecode to the conversion directory, which may be configurable (directory name from environ variable with default fallback). May even be a search path for conversion files, so conversion files may be installed in different locations. > otherwise a static binary is not self-contained > and you have to move parts of the libc around > along with the binary and if they are loaded > from fixed path then it does not work at all > (permissions, conflicting versions etc) Ok, I see the static linking topic, but this is no problem with byte code conversion programs. It can easily be added: Just add all the conversion byte code programs together to a single big array, with a name and offset table ahead, then link it into your program. May be done in two steps: 1) Create a selection file for musl build, and include the specified charsets in libc.a/.so 2) Select the required charset files and create an .o file to link into your program. iconv then shall: - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. - search table of with libc linked charsets - search table of with the program linked charsets - search for charset on external search path ... or do in opposite direction and use first charset conversion found. This lookup is usually very small, except file system search, so it shall not produce much overhead / bloat. [Addendum after thinking a bit more: The byte code conversion files shall exist of a small statical header, followed by the byte code program. The header shall contain the charset name, version of required virtual machine and length of byte code. So you need only add all such conversion files to a big array of bytes and add a Null header to mark the end of table. Then you only need the start of the array and you are able to search through for a specific charset. The iconv function in libc contains a definition for an "unsigned char const *iconv_user_charsets = NULL;", which is linked in, when the user does not provide it's own definition. So iconv can search all linked in charset definitions, and need no code changes. Really simple configuration to select charsets to build in.] > if the format changes then dynamic linking is > problematic as well: you cannot update libc > in a single atomic operation The byte code shall be independent of dynamic linking. The conversion files are only streams of bytes, which shall also be architecture independent. So you do only need to update the conversion files if the virtual machine definition of iconv has been changed (shall not be done much). External files may be read into malloc-ed buffers or mmap-ed, not linked in by the dynamical linker. -- Harald
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.