Date: Tue, 22 Jul 2014 14:49:32 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Locale bikeshed time I've got the next phase of the locale work pretty much ready to commit, but since it needs some policy for how to load locales, I want to continue the discussion first rather than having commits that change the behavior back and forth as we discuss this. Overall, my plan at this point is to disallow any absolute/relative pathnames in the LC_* vars and restrict them purely to locale names, and have the path in a separate variable outside the scope of the standard. This is basically how glibc does it, and the idea is that you can allow locale names from an untrusted source (e.g. for suid, for remote apps acting on behalf of a user such as web apps or gitolite, or for apps that process mixed-locale data with uselocale and have locale names in their data) as long as the locale path does not contain malicious locales. So, the first bikeshed decision to be made is what environment variable to use for the locale path, and what fallback should be if it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to use the same var (since apps are already aware of the need to treat it specially), but on the other it's undesirable to have them tied together (e.g. if you're using musl as a non-root installation and can't write to /usr/lib) and to avoid clashing with glibc's files we would need to choose a subdirectory under $LOCPATH rather than using it directly. All of these aspects make it a lot less attractive. The second issue is how locale categories are split up. Glibc has each category in a separate file, except for the "locale-archive" file which stores everything in one file for easy mapping. My leaning so far is to put the whole locale -- time format and translations, message translations, ... in a single file. This avoids the need for multiple mappings (and syscall overhead, and vma overhead, ...) if you're using the same value for all categories. But on the other hand, if you wanted to have lots of subtle variants of a locale, you might end up with largely-duplicate files on disk. Fortunately I think they'll all be very small anyway so this may not matter. Of course making this work is contingent on finding a good way to encode LC_MONETARY and LC_COLLATE data in a .mo file, since if the whole locale is unified into one file, it would be a .mo file. My leaning is to simply use "int_cur_symbol", etc. as gettext keys for the string fields of LC_MONETARY and then put all the numeric fields of lconv into a single string that could be parsed with scanf or a tiny integer parser in localeconv() on the first usage. While not the most efficient, it avoids needing nasty special tools to generate locale files; a po-to-mo converter is all you need. For LC_COLLATE, obviously one solution would be to have keys for each collation element and use gettext to convert collation elements to the symbols strxfrm is supposed to output. I'm not sure if the efficiency of this method is tolerable however. We could go with it for now and later add something more advanced if needed (e.g. mapping to a DFA represented as a byte arrary that does the conversions). I probably have some more issues to discuss with this too but I'll just go ahead and send now to get discussion started, and hopefully get back to adding some more code first. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.