musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150608003315.GD17573@brightrain.aerifal.cx>
Date: Sun, 7 Jun 2015 20:33:15 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote:
> On 07.06.2015 02:24, Rich Felker wrote:
> >It's somewhat more clear what you're talking about, but I'm still not
> >sure what specific pieces of code you would want to omit from libc.so.
> >Which of the following would you want to remove or keep?
> 
> I did not look into all the details ...
> 
> In general: Keep the API, but add stubs with minimal operation or
> fail for none C locale (etc.).
> 
> >- UTF-8 encoding and decoding
> 
> May be of use to keep, if on bare minimum.

This is roughly 3k of code, and is mandatory if you want to say you
"support UTF-8" at all. I'll note the other parts that fundamentally
depend on it.

> >- Character properties
> > - Case mappings
> 
> Keep ASCII, map all none ASCII to a single value.

I assume by "map to a single value" you mean uniform properties for
all non-ASCII Unicode characters, e.g. just printable but nothing
else. Case-mapping everything down to one character would not be a
good idea. :-)

Character properties are roughly 11k of code. Case mappings are 1k of
code.

Note that while some of the properties are arguably not very useful
(the wctype system does not give you enough information to do serious
text processing with them), without the wcwidth property, you cannot
properly display non-ASCII text on a terminal. So at least this one,
which takes 3k, is pretty critical to "UTF-8 support".

> >- Internal message translation (nl_langinfo strings, errors, etc.)
> > - Message translation API (gettext)
> 
> No translation at all, keep the English messages (as short as possible).

The internal translation support is about 2k. The gettext system is
roughly another 2k on top of that (and depends on the former).

I agree this is completely non-mandatory for "UTF-8 support" and
that's why musl originally didn't have it.

> >- Charset conversion (iconv)
> 
> Copy ASCII / UTF-8, but fail for all other.

iconv is big. About 128k. The ability to selectively omit some or all
legacy charsets from iconv is a long-term goal.

Of course if you have an actual need for character set conversion,
e.g. reading email in mutt, then your alternative to musl's 128k iconv
is GNU libiconv weighing in at several MB...

> >- Non-ASCII characters in regex and fnmatch patterns/brackers
> 
> May be the question to allow for UTF-8, but only those, no other
> charsets (should allow to do some optimization and avoid all the
> extended overhead).

That's how it is now.

> fnmatch: Match None ASCII just 1:1, no other special operation.
> 
> regex: Don't have the experience on the internals of this topic. In
> general allow for 1:1 matching of none ASCII characters, but
> otherwise behave as C locale (e.g. equivalence classes).

For both fnmatch and regex, the single-character-match (? or .
respectively) matches characters, not bytes. Likewise bracket
expressions match characters. In order for this to work at all, you
need UTF-8 decoding (see above).

There's no directly measurable code size cost for these items; the
savings from not doing UTF-8 would come from completely different code
that doesn't now exist in musl for bypassing mbtowc and just working
directly on input bytes.

So aside from iconv, the above seem to total around 19k, and at least
6k of that is mandatory if you want to be able to claim to support
UTF-8. So the topic at hand seems to be whether you can save <13k of
libc.so size by hacking out character handling/locale related features
that are non-essential to basic UTF-8 support...

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.