musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150608040629.GF17573@brightrain.aerifal.cx>
Date: Mon, 8 Jun 2015 00:06:29 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On Mon, Jun 08, 2015 at 04:46:42AM +0200, Harald Becker wrote:
> On 08.06.2015 02:33, Rich Felker wrote:
> >On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote:
> >>On 07.06.2015 02:24, Rich Felker wrote:
> >>>It's somewhat more clear what you're talking about, but I'm still not
> >>>sure what specific pieces of code you would want to omit from libc.so.
> >>>Which of the following would you want to remove or keep?
> >>
> >>I did not look into all the details ...
> >>
> >>In general: Keep the API, but add stubs with minimal operation or
> >>fail for none C locale (etc.).
> >>
> >>>- UTF-8 encoding and decoding
> >>
> >>May be of use to keep, if on bare minimum.
> >
> >This is roughly 3k of code, and is mandatory if you want to say you
> >"support UTF-8" at all. I'll note the other parts that fundamentally
> >depend on it.
> 
> 3k ? Which functions do you add to this? ... and I don't see it is
> so mandatory for pure C locale. UTF-8 shall only pass through and
> not break the base operation.

Some of that is just the sheer number of interfaces mandated by the
language; each one is going to be at least 50 bytes or so even if it
basically does nothing. But most of it is about having non-pitiful
performance for string ops. The core UTF-8 encode/decode is under 800
bytes total and could be reduced to around 500 at a moderate
performance cost.

> >>>- Character properties
> >>>- Case mappings
> >>
> >>Keep ASCII, map all none ASCII to a single value.
> >
> >I assume by "map to a single value" you mean uniform properties for
> >all non-ASCII Unicode characters, e.g. just printable but nothing
> >else. Case-mapping everything down to one character would not be a
> >good idea. :-)
> 
> I mean any none ASCII say "NOT ASCII" (or may be "UTF-8"). Don't do
> any case mapping for none "C" locale, etc.

Whatever you do, case mappings are one of the cheapest parts, at only
1k. I don't see any benefit to be had here. You'd probably spend that
much in _code_ size complexifying the code to deal with the
possibility of variable case-mapping state.

> >Character properties are roughly 11k of code. Case mappings are 1k of
> >code.
> 
> When narrowing things to pure ASCII / C locale, are there still no
> chance to cut this down?

In that case you have a broken system that does not support UTF-8.
This is not something musl will support. One of the core principles of
musl is that there are not two tiers of characters/languages.
Everything is UTF-8 and all characters are usable anywhere.

> >Note that while some of the properties are arguably not very useful
> >(the wctype system does not give you enough information to do serious
> >text processing with them), without the wcwidth property, you cannot
> >properly display non-ASCII text on a terminal. So at least this one,
> >which takes 3k, is pretty critical to "UTF-8 support".
> 
> There are not so many applications which require full text
> processing. And the resulting lib shall work correct for any text in
> the base C locale, but shall allow to embed UTF-8 sequences. If an
> application needs to handle those sequences special it has to be
> done in the application.
> 
> Again: This not to build a lib for a general purpose desktop system.
> It is for an optional stripped down version.

For what purpose? A size reduction of 1-2% is not going to make the
difference between libc.so fitting somebody's needs or not fitting
them. And musl is pretty close to being as small as you can make a
full POSIX libc without resorting to abysmal performance (e.g.
quadratic qsort and strstr, all byte-at-a-time string functions,
unbuffered stdio, etc.) so I don't think there's even a theoretical
"these savings might be more than 2% if other parts of libc were more
size-optimized".

> >>>- Internal message translation (nl_langinfo strings, errors, etc.)
> >>>- Message translation API (gettext)
> >>
> >>No translation at all, keep the English messages (as short as possible).
> >
> >The internal translation support is about 2k. The gettext system is
> >roughly another 2k on top of that (and depends on the former).
> 
> Strip down to nearly nothing. Return the key string as result of
> translation, just as if there is no translation available.

Yes, that's doable but not very beneficial.

> >iconv is big. About 128k. The ability to selectively omit some or all
> >legacy charsets from iconv is a long-term goal.
> 
> This is why I like to cut that down. With dropping everything except
> ASCII / C locale support and may be a base set of UTF-8 operation,
> it should be possible to do heavy optimization.

This is the one item in the list that is a strong candidate for
configurability. It's very large (roughly 25% the size of libc.so) and
omitting it has no impact on C/POSIX correctness or on first-class
status of all characters, It's just optional, implementation-defined
functionality (albeit very useful functionality).

> >Of course if you have an actual need for character set conversion,
> >e.g. reading email in mutt, then your alternative to musl's 128k iconv
> >is GNU libiconv weighing in at several MB...
> 
> If you really have a need for conversion, this option is not for
> you, or you need to do the conversion in the application ... there
> are not so many applications which require such full and flexible
> character set conversions, and even those work with the stripped
> down version as long as you stay at pure ASCII text. For many
> dedicated and small systems a pure ASCII operation may be all
> required (e.g. emulator sets, container, etc.).
> 
> >For both fnmatch and regex, the single-character-match (? or .
> >respectively) matches characters, not bytes. Likewise bracket
> >expressions match characters. In order for this to work at all, you
> >need UTF-8 decoding (see above).
> 
> No! When optimizing for pure UTF-8 support, you can do clever
> optimizations, eliminating that for and back of UTF-8 character
> conversion completely. That is what I like to get.

I would like to do UTF-8-specific optimizations in regex, but what
they buy you is performance, not code size. The code is moderately
larger because generating a byte-based NFA for matching dot, bracket,
etc. of UTF-8 characters is much more complex than a character-based
NFA.

> e.g. it is simple to match characters not bytes without decoding the
> UTF-8 sequences by just skipping the extension bytes (0x80..0xBF)
> where required.

No, that doesn't work because it also matches junk, including illegal
sequences which are highly unsafe to match. For a UTF-8-specific regex
implementation that works in bytes, "." has to compile as the pattern
represented by the ABNF in RFC 3629.

> >So aside from iconv, the above seem to total around 19k, and at least
> >6k of that is mandatory if you want to be able to claim to support
> >UTF-8. So the topic at hand seems to be whether you can save <13k of
> >libc.so size by hacking out character handling/locale related features
> >that are non-essential to basic UTF-8 support...
> 
> I like to get a stripped down version, which eliminate all the
> unnecessary char set handling code used in dedicated systems, but
> stripping that on every release is too much work to do.
> 
> The benefit may be for:
> 
> - embedded systems
> - small initramfs based systems
> - container systems
> - minimal chroot environments

"May be"? I would like at least one citation for an instance where
stripping 1-2% of the size from libc.so makes a difference for any of
these. As stated before I'm open to (and in fact aiming for, in the
long term) making it possible to reduce iconv coverage, but I haven't
seen anything remotely convincing as an argument to break things all
over libc and break the property that the entire UTF-8 character space
has first-class status for the sake of saving a couple kb.

> It's intention is not as a lib for general purpose desktop systems,
> but musl has reached a state where it's standard compliance boosts
> it over other small libs. The caveat is every new functionality make
> it completely into the shared library. So I'm looking for a
> possibility to drop all that char set and locale handling stuff,
> without losing API compatibility for base C locale operation, and
> passing through or doing some clever handling on UTF-8 sequences.

Doing UTF-8 right is hard because it's a balance of lots of
constraints. Most people get them wrong. Pretending UTF-8 is just
another 8-bit encoding does not work for the vast majority of cases,
and while there certainly ARE cases where that is possible, and there
ARE people who are capable of finding them and taking advantage of
them, if nothing else this thread has served to demonstrate that
strong desire to take these shortcuts is not highly correlated with
being qualified to know when they can be taken...

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.