musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55750212.5090304@gmx.de>
Date: Mon, 08 Jun 2015 04:46:42 +0200
From: Harald Becker <ralda@....de>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On 08.06.2015 02:33, Rich Felker wrote:
> On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote:
>> On 07.06.2015 02:24, Rich Felker wrote:
>>> It's somewhat more clear what you're talking about, but I'm still not
>>> sure what specific pieces of code you would want to omit from libc.so.
>>> Which of the following would you want to remove or keep?
>>
>> I did not look into all the details ...
>>
>> In general: Keep the API, but add stubs with minimal operation or
>> fail for none C locale (etc.).
>>
>>> - UTF-8 encoding and decoding
>>
>> May be of use to keep, if on bare minimum.
>
> This is roughly 3k of code, and is mandatory if you want to say you
> "support UTF-8" at all. I'll note the other parts that fundamentally
> depend on it.

3k ? Which functions do you add to this? ... and I don't see it is so 
mandatory for pure C locale. UTF-8 shall only pass through and not break 
the base operation.


>>> - Character properties
>>> - Case mappings
>>
>> Keep ASCII, map all none ASCII to a single value.
>
> I assume by "map to a single value" you mean uniform properties for
> all non-ASCII Unicode characters, e.g. just printable but nothing
> else. Case-mapping everything down to one character would not be a
> good idea. :-)

I mean any none ASCII say "NOT ASCII" (or may be "UTF-8"). Don't do any 
case mapping for none "C" locale, etc.

> Character properties are roughly 11k of code. Case mappings are 1k of
> code.

When narrowing things to pure ASCII / C locale, are there still no 
chance to cut this down?

> Note that while some of the properties are arguably not very useful
> (the wctype system does not give you enough information to do serious
> text processing with them), without the wcwidth property, you cannot
> properly display non-ASCII text on a terminal. So at least this one,
> which takes 3k, is pretty critical to "UTF-8 support".

There are not so many applications which require full text processing. 
And the resulting lib shall work correct for any text in the base C 
locale, but shall allow to embed UTF-8 sequences. If an application 
needs to handle those sequences special it has to be done in the 
application.

Again: This not to build a lib for a general purpose desktop system. It 
is for an optional stripped down version.

>>> - Internal message translation (nl_langinfo strings, errors, etc.)
>>> - Message translation API (gettext)
>>
>> No translation at all, keep the English messages (as short as possible).
>
> The internal translation support is about 2k. The gettext system is
> roughly another 2k on top of that (and depends on the former).

Strip down to nearly nothing. Return the key string as result of 
translation, just as if there is no translation available.

> iconv is big. About 128k. The ability to selectively omit some or all
> legacy charsets from iconv is a long-term goal.

This is why I like to cut that down. With dropping everything except 
ASCII / C locale support and may be a base set of UTF-8 operation, it 
should be possible to do heavy optimization.

> Of course if you have an actual need for character set conversion,
> e.g. reading email in mutt, then your alternative to musl's 128k iconv
> is GNU libiconv weighing in at several MB...

If you really have a need for conversion, this option is not for you, or 
you need to do the conversion in the application ... there are not so 
many applications which require such full and flexible character set 
conversions, and even those work with the stripped down version as long 
as you stay at pure ASCII text. For many dedicated and small systems a 
pure ASCII operation may be all required (e.g. emulator sets, container, 
etc.).

> For both fnmatch and regex, the single-character-match (? or .
> respectively) matches characters, not bytes. Likewise bracket
> expressions match characters. In order for this to work at all, you
> need UTF-8 decoding (see above).

No! When optimizing for pure UTF-8 support, you can do clever 
optimizations, eliminating that for and back of UTF-8 character 
conversion completely. That is what I like to get.

e.g. it is simple to match characters not bytes without decoding the 
UTF-8 sequences by just skipping the extension bytes (0x80..0xBF) where 
required.

> There's no directly measurable code size cost for these items; the
> savings from not doing UTF-8 would come from completely different code
> that doesn't now exist in musl for bypassing mbtowc and just working
> directly on input bytes.

That is what i like to get, just working on the stream of input bytes, 
and leave UTF-8 a sequence of bytes in the string. Many applications 
don't matter those sequences either, and for other applications they may 
be acted on correct, when using some clever programming.

> So aside from iconv, the above seem to total around 19k, and at least
> 6k of that is mandatory if you want to be able to claim to support
> UTF-8. So the topic at hand seems to be whether you can save <13k of
> libc.so size by hacking out character handling/locale related features
> that are non-essential to basic UTF-8 support...

I like to get a stripped down version, which eliminate all the 
unnecessary char set handling code used in dedicated systems, but 
stripping that on every release is too much work to do.

The benefit may be for:

- embedded systems
- small initramfs based systems
- container systems
- minimal chroot environments

It's intention is not as a lib for general purpose desktop systems, but 
musl has reached a state where it's standard compliance boosts it over 
other small libs. The caveat is every new functionality make it 
completely into the shared library. So I'm looking for a possibility to 
drop all that char set and locale handling stuff, without losing API 
compatibility for base C locale operation, and passing through or doing 
some clever handling on UTF-8 sequences.

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.