Date: Mon, 08 Jun 2015 04:46:42 +0200 From: Harald Becker <ralda@....de> To: musl@...ts.openwall.com Subject: Re: Build option to disable locale [was: Byte-based C locale, draft 1] On 08.06.2015 02:33, Rich Felker wrote: > On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote: >> On 07.06.2015 02:24, Rich Felker wrote: >>> It's somewhat more clear what you're talking about, but I'm still not >>> sure what specific pieces of code you would want to omit from libc.so. >>> Which of the following would you want to remove or keep? >> >> I did not look into all the details ... >> >> In general: Keep the API, but add stubs with minimal operation or >> fail for none C locale (etc.). >> >>> - UTF-8 encoding and decoding >> >> May be of use to keep, if on bare minimum. > > This is roughly 3k of code, and is mandatory if you want to say you > "support UTF-8" at all. I'll note the other parts that fundamentally > depend on it. 3k ? Which functions do you add to this? ... and I don't see it is so mandatory for pure C locale. UTF-8 shall only pass through and not break the base operation. >>> - Character properties >>> - Case mappings >> >> Keep ASCII, map all none ASCII to a single value. > > I assume by "map to a single value" you mean uniform properties for > all non-ASCII Unicode characters, e.g. just printable but nothing > else. Case-mapping everything down to one character would not be a > good idea. :-) I mean any none ASCII say "NOT ASCII" (or may be "UTF-8"). Don't do any case mapping for none "C" locale, etc. > Character properties are roughly 11k of code. Case mappings are 1k of > code. When narrowing things to pure ASCII / C locale, are there still no chance to cut this down? > Note that while some of the properties are arguably not very useful > (the wctype system does not give you enough information to do serious > text processing with them), without the wcwidth property, you cannot > properly display non-ASCII text on a terminal. So at least this one, > which takes 3k, is pretty critical to "UTF-8 support". There are not so many applications which require full text processing. And the resulting lib shall work correct for any text in the base C locale, but shall allow to embed UTF-8 sequences. If an application needs to handle those sequences special it has to be done in the application. Again: This not to build a lib for a general purpose desktop system. It is for an optional stripped down version. >>> - Internal message translation (nl_langinfo strings, errors, etc.) >>> - Message translation API (gettext) >> >> No translation at all, keep the English messages (as short as possible). > > The internal translation support is about 2k. The gettext system is > roughly another 2k on top of that (and depends on the former). Strip down to nearly nothing. Return the key string as result of translation, just as if there is no translation available. > iconv is big. About 128k. The ability to selectively omit some or all > legacy charsets from iconv is a long-term goal. This is why I like to cut that down. With dropping everything except ASCII / C locale support and may be a base set of UTF-8 operation, it should be possible to do heavy optimization. > Of course if you have an actual need for character set conversion, > e.g. reading email in mutt, then your alternative to musl's 128k iconv > is GNU libiconv weighing in at several MB... If you really have a need for conversion, this option is not for you, or you need to do the conversion in the application ... there are not so many applications which require such full and flexible character set conversions, and even those work with the stripped down version as long as you stay at pure ASCII text. For many dedicated and small systems a pure ASCII operation may be all required (e.g. emulator sets, container, etc.). > For both fnmatch and regex, the single-character-match (? or . > respectively) matches characters, not bytes. Likewise bracket > expressions match characters. In order for this to work at all, you > need UTF-8 decoding (see above). No! When optimizing for pure UTF-8 support, you can do clever optimizations, eliminating that for and back of UTF-8 character conversion completely. That is what I like to get. e.g. it is simple to match characters not bytes without decoding the UTF-8 sequences by just skipping the extension bytes (0x80..0xBF) where required. > There's no directly measurable code size cost for these items; the > savings from not doing UTF-8 would come from completely different code > that doesn't now exist in musl for bypassing mbtowc and just working > directly on input bytes. That is what i like to get, just working on the stream of input bytes, and leave UTF-8 a sequence of bytes in the string. Many applications don't matter those sequences either, and for other applications they may be acted on correct, when using some clever programming. > So aside from iconv, the above seem to total around 19k, and at least > 6k of that is mandatory if you want to be able to claim to support > UTF-8. So the topic at hand seems to be whether you can save <13k of > libc.so size by hacking out character handling/locale related features > that are non-essential to basic UTF-8 support... I like to get a stripped down version, which eliminate all the unnecessary char set handling code used in dedicated systems, but stripping that on every release is too much work to do. The benefit may be for: - embedded systems - small initramfs based systems - container systems - minimal chroot environments It's intention is not as a lib for general purpose desktop systems, but musl has reached a state where it's standard compliance boosts it over other small libs. The caveat is every new functionality make it completely into the shared library. So I'm looking for a possibility to drop all that char set and locale handling stuff, without losing API compatibility for base C locale operation, and passing through or doing some clever handling on UTF-8 sequences. -- Harald
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.