musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMAJcuABLq2puSrTw18f_L-b0t96R-BEXNMg=uopBye+t2UXaQ@mail.gmail.com>
Date: Sun, 7 Jun 2015 22:51:21 -0500
From: Josiah Worcester <josiahw@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On Sun, Jun 7, 2015 at 10:35 PM, Harald Becker <ralda@....de> wrote:
> On 08.06.2015 04:36, Rich Felker wrote:
>>
>> Do you have an application in mind where saving ~13k in libc.so would
>> make the difference between being able to use it or not?
>
>
> I think I answered this already in the other message.
>
>> Maintaining useless configuration knobs and having to test them all is
>> also too much work, and the main reason why uClibc is dying.
>
>
> Too many knobs make it difficult to maintain, this is always a problem, but
> without controllability to get a monolithic hunk which may not fit all
> needs. That is why I look for a knob to remove all the char set and locale
> handling, dropping to a C locale only ... with some clever optimization for
> dedicated UTF-8 handling.
>

The point is "all the charset and locale handling" is miniscule,
spread-out, and not easily removed. It's also only UTF-8 handling.

>> That's how it always has been. You could actually make it smaller by
>> writing the string functions as trivial character-at-a-time loops on
>> top of mbtowc/wctomb, but you'd be sacrificing performance to save a
>> few hundred bytes at most. This does not seem like a worthwhile
>> trade-off.
>
>
> Rich, UTF-8 allows for character operation without converting for and back
> the individual sequences. I never mbtowc/wctomb, or even any other wide
> character function. There are only a few operations which really need to
> work at Unicode code point values (not characters).

You need to be incredibly careful and essentially parse straight to
codepoints in order to not accept invalid UTF-8. Most notably,
over-long sequences and surrogate pairs should not be accepted and to
understand those you have to parse it.

>> musl's "C" locale is _not_ ASCII but is UTF-8. Supporting both
>> byte-at-a-time operation in the C locale and proper UTF-8 handling
>> otherwise is the topic of this thread, and it very mildly increases
>> code size.
>
>
> So this is why I jumped in on that topic. You approximately get near what I
> like to get when you read my request as: "Create a lib with that single C
> locale handling, without any other locale or char set stuff".
>
> ... but I think you still do some UTF-8 operations too complicated. It looks
> to me you are fixating on that wide character and multi byte parts, which I
> consider not to be required. UTF-8 is a sequence of bytes, so keep them just
> as this sequence. Most string operations allow for clever optimization.

When it's reasonable musl does operate right on byte sequences. But
there are operations  that only make sense on units of codepoints.

>> This probably could be a configurable option at some point, but it's
>> not going to save you any meaningful amount of space. Best case would
>> be saving about 4k.
>
> ... another 4k of unnecessary code.

4k is not exactly worth optimizing unless there's a really compelling
reason, especially when it provides this much benefit.

>> 119k of the 128k for iconv is legacy CJK character set tables. The
>> code for them is only 1k or so. Legacy 8bit codepages are another 6k.
>> The easy way to make things optional here would be just dropping
>> tables for configured-out charsets.
>
>
> Kick them all off, just ASCII, UTF-8, and 1:1 copy operation.
>
>> "Don't break" means handling UTF-8.
>
>
> "Dont break" means clever handling of embedded UTF-8 sequences, without
> conversion of every sequence to wide char values, or similar.

If you don't parse UTF-8 you accept invalid UTF-8 which is itself a
terrible bug.

>> Is this a practical need or an ideological one?
>
>
> I consider this practical need, as I like to setup highly specialized chroot
> environments (isolating the applications to there controlled set of
> accessible data, like virtual hosts).

We are arguing over 135 kilobytes maximum. Even if (*if*) you could
strip some space from libc here, I guarantee there's much more
interesting things to optimize. For instance, your kernel is several
megabytes, your applications are incredibly unlikely to be reasonably
efficient with space, etc.

As such this is vastly more ideological (and silly) than practical.
Especially as those kilobytes (kilobytes!) are very useful kilobytes.

>
>> Your goal keeps shifting back between supporting UTF-8 and "passing
>> through sequences" which is _NOT_ supporting UTF-8. If I accidentally
>> type a non-ASCII character on a command line then press backspace and
>> the character disappears but there are still two hidden junk bytes on
>> the command line, that is a broken system.
>
>
> That would be a broken input system or read line system. In cooked mode you
> won't get that either, and in raw mode you need and can handle that with
> some simple checks. So you don't need that full multi byte and wide
> character handling to handle UTF-8.

You need to know the width in units of terminal cells of characters if
you're doing anything involving moving the cursor or editing the
screen though...
And again, you need to parse UTF-8 to not accept invalid UTF-8.

>> by with just treating text as abstract byte strings that can pass
>> through anything, but for any kind of visual presentation (even on a
>> terminal) or entry/editing, you need to know character identity.
>
>
> Do you? I consider know. The terminal program (or font handling part) may
> need this. Even on on a simple textual Linux console, the kernel knows how
> to display the character. The only thing to know, is the correct number of
> bytes to send for a specific character.

And the width of the character in terminal cells.

>> Fortunately, as I've been trying to say, that's extremely cheap. You
>> could save more space by making the string functions or qsort or
>> something naive and slow...
>
>
> Do they need to be slow? UTF-8 can be handled without that full wide
> character and multi byte stuff you are throwing in, and I want to get bare
> ASCII operation plus this simple UTF-8 handling. Mostly musl does this, but
> adds in some other char set and locale handling stuff, which I like to opt
> out. On statical linking it is easy, but I like to get a shared library
> without all that extra stuff.

"Simple UTF-8 handling" is what we have. Have you looked at the wide
character stuff? If you start taking stuff off you literally stop
doing UTF-8 right.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.