musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMAJcuC62F8jp76XFac-z505HUnJDkxFBzSw1D+VKK+0ws3Fxw@mail.gmail.com>
Date: Sun, 7 Jun 2015 19:28:51 -0500
From: Josiah Worcester <josiahw@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On Sun, Jun 7, 2015 at 6:59 PM, Harald Becker <ralda@....de> wrote:
> On 07.06.2015 02:24, Rich Felker wrote:
>>
>> It's somewhat more clear what you're talking about, but I'm still not
>> sure what specific pieces of code you would want to omit from libc.so.
>> Which of the following would you want to remove or keep?
>
>
> I did not look into all the details ...
>

To start with: keep in mind that in the case of static linking most of
this is not at all pulled in except when strictly necessary. Static
linking might be more relevant to your needs.

> In general: Keep the API, but add stubs with minimal operation or fail for
> none C locale (etc.).
>
>> - UTF-8 encoding and decoding
>
>
> May be of use to keep, if on bare minimum.

Seeing as the UTF-8 decoder is very small already, I'd be shocked if
you could make an argument for removing that.

>> - Character properties
>
>> - Case mappings
>
> Keep ASCII, map all none ASCII to a single value.

This would be not-quite-right. Also, the case mapping tables are quite
small. towctrans.lo which contains the case mappings is 1106 bytes.

>> - Internal message translation (nl_langinfo strings, errors, etc.)
>
>> - Message translation API (gettext)
>
> No translation at all, keep the English messages (as short as possible).

musl does not have any translations in it at all. It only has a small
portion of logic able to load external translations. locale_map.lo and
__mo_lookup.lo which are together responsible for this, are a total of
1471 bytes.

>> - Charset conversion (iconv)
>
>
> Copy ASCII / UTF-8, but fail for all other.

Though quite possible, it's worth noting that musl iconv is not very
large. iconv.lo is 128408 bytes, or 125k.

>> - Non-ASCII characters in regex and fnmatch patterns/brackers
>
>
> May be the question to allow for UTF-8, but only those, no other charsets
> (should allow to do some optimization and avoid all the extended overhead).

This is already the case.

> fnmatch: Match None ASCII just 1:1, no other special operation.

fnmatch.lo itself is 2227 bytes right now and none of that is in UTF-8
handling. The body of that is in mbtowc.lo and mbsrtowcs.lo, which are
227 bytes and 636 bytes respectively.

> regex: Don't have the experience on the internals of this topic. In general
> allow for 1:1 matching of none ASCII characters, but otherwise behave as C
> locale (e.g. equivalence classes).
>

The regex equivalence classes are handled via the isw* functions which
(as mentioned above) are quite small.

In short, it seems like if we made these changes we'd maybe be able to
trim out 135k and almost all of that would be in iconv. Though I
appreciate the desire for smaller code, this doesn't quite seem like
the place to go looking.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.