Date: Mon, 5 Aug 2013 09:03:43 +0200 From: Harald Becker <ralda@....de> Cc: musl@...ts.openwall.com, nsz@...t70.net Subject: Re: iconv Korean and Traditional Chinese research so far Hi ! 05-08-2013 05:13 Szabolcs Nagy <nsz@...t70.net>: > * Harald Becker <ralda@....de> [2013-08-05 03:24:52 +0200]: > > iconv then shall: > > - look for some fixed charsets like ASCII, Latin-1, UTF-8, > > etc. > > - search table of with libc linked charsets > > - search table of with the program linked charsets > > - search for charset on external search path > > sounds like a lot of extra management cost > (for libc, application writer and user as well) This is not so much work. You already need to search for the character set table to use, that is you need to search at least a table of string values to find the pointer to the conversion table. Searching a table in may statement above, means just waking a pointer chain doing the string compares to find a matching character set. Not much difference to the really required code. Now do this twice to check possible user chain, is just one more helper function call. The only code that get a bit more, is the file system search. This depends if we only try single location or walk through a search path list. But this is the cost of flexibility to dynamically load character set conversions (which I would really prefer for seldom used char sets). ... and for application writer it is only more, if he likes to add some charset tables into his program, which are not in statical libc. The problem is, all tables in libc need to be linked to your program, if you include iconv. So each added charset conversion increases size of your program ... and I definitly won't include Japanese, Chinese or Korean charsets in my program. No that I ignore those peoples need, I just wont need it, so I don't like to add those conversions to programs sitting on my disk. > it would be nice if the compiler could figure out > at build time (eg with lto) which tables are used > but i guess charsets often only known at runtime How do you want to do this? And how shall the compiler know which char sets the user may use during operation? So the only way to select the charset tables to include in your program, is by assuming ahead, which tables might be used. That is part of the configuration of musl build or application program build. > > [Addendum after thinking a bit more: The byte code conversion > > files shall exist of a small statical header, followed by the > > byte code program. The header shall contain the charset name, > > version of required virtual machine and length of byte code. > > So you need only add all such conversion files to a big array > > of bytes and add a Null header to mark the end of table. Then > > you only need the start of the array and you are able to > > search through for a specific charset. The iconv function in > > libc contains a definition for an "unsigned char const > > *iconv_user_charsets = NULL;", which is linked in, when the > > user does not provide it's own definition. So iconv can > > search all linked in charset definitions, and need no code > > changes. Really simple configuration to select charsets to > > build in.] > > > > yes that can work, but it's a musl specific hack > that the application programmer need to take care of Only if application programmer wants to add a char set to the statical build program, which is not in libc, some extra work has to be done. Giving some more flexibility. If you don't care, you get the musl build in list of char sets. > > > if the format changes then dynamic linking is > > > problematic as well: you cannot update libc > > > in a single atomic operation > > > > The byte code shall be independent of dynamic linking. The > > conversion files are only streams of bytes, which shall also > > be architecture independent. So you do only need to update the > > conversion files if the virtual machine definition of iconv > > has been changed (shall not be done much). External files may > > be read into malloc-ed buffers or mmap-ed, not linked in by > > the dynamical linker. > > > > that does not solve the format change problem > you cannot update libc without race > (unless you first replace the .so which supports > the old format as well as the new one, but then > libc has to support all previous formats) If the definition of the iconv virtual state machine is modified, you need to do extra care on update (delete old charset files, install new lib, install new charset files, restart system) ... but this is only required on a major update. As soon as the virtual machine definition gots stabilized you do not need to change charset definition files, or just do update your lib, then update possible new charset files. After an initial phase of testing this shall happen relatively seldom, that the virtual machine definition needs to be changed in an incompatible manner. And simple extending the virtual machine does not invalidate the old charset files. > it's probably easy to design a fixed format to > avoid this A fixed format? For what? Do you know the differences of char sets, especially multi byte char sets? > it seems somewhat similar to the timezone problem > ecxept zoneinfo is maintained outside of libc so > there is not much choice, but there are the same > issues: updating it should be done carefully, > setuid programs must be handled specially etc Again. As soon as the virtual machine definition has reached a stable state, it shall not happen much, that any change invalidates a charset definition file. That is at least old files will continue to work with newer lib versions. So there is no problem on update, just update your lib then update your charset files. The only problem will be, if a still running application uses a new charset file with an old version of the lib. This will be detected and leads to a failure code of iconv. So you need to restart your application ... which is always a good decision as you updated your lib. -- Harald
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.