musl - Re: Build option to disable locale [was: Byte-based C locale, draft 1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55750D9B.3010002@gmx.de>
Date: Mon, 08 Jun 2015 05:35:55 +0200
From: Harald Becker <ralda@....de>
To: musl@...ts.openwall.com
Subject: Re: Build option to disable locale [was: Byte-based C locale,
 draft 1]

On 08.06.2015 04:36, Rich Felker wrote:
> Do you have an application in mind where saving ~13k in libc.so would
> make the difference between being able to use it or not?

I think I answered this already in the other message.

> Maintaining useless configuration knobs and having to test them all is
> also too much work, and the main reason why uClibc is dying.

Too many knobs make it difficult to maintain, this is always a problem, 
but without controllability to get a monolithic hunk which may not fit 
all needs. That is why I look for a knob to remove all the char set and 
locale handling, dropping to a C locale only ... with some clever 
optimization for dedicated UTF-8 handling.

> That's how it always has been. You could actually make it smaller by
> writing the string functions as trivial character-at-a-time loops on
> top of mbtowc/wctomb, but you'd be sacrificing performance to save a
> few hundred bytes at most. This does not seem like a worthwhile
> trade-off.

Rich, UTF-8 allows for character operation without converting for and 
back the individual sequences. I never mbtowc/wctomb, or even any other 
wide character function. There are only a few operations which really 
need to work at Unicode code point values (not characters).

> musl's "C" locale is _not_ ASCII but is UTF-8. Supporting both
> byte-at-a-time operation in the C locale and proper UTF-8 handling
> otherwise is the topic of this thread, and it very mildly increases
> code size.

So this is why I jumped in on that topic. You approximately get near 
what I like to get when you read my request as: "Create a lib with that 
single C locale handling, without any other locale or char set stuff".

... but I think you still do some UTF-8 operations too complicated. It 
looks to me you are fixating on that wide character and multi byte 
parts, which I consider not to be required. UTF-8 is a sequence of 
bytes, so keep them just as this sequence. Most string operations allow 
for clever optimization.

> This probably could be a configurable option at some point, but it's
> not going to save you any meaningful amount of space. Best case would
> be saving about 4k.

... another 4k of unnecessary code.

> 119k of the 128k for iconv is legacy CJK character set tables. The
> code for them is only 1k or so. Legacy 8bit codepages are another 6k.
> The easy way to make things optional here would be just dropping
> tables for configured-out charsets.

Kick them all off, just ASCII, UTF-8, and 1:1 copy operation.

> "Don't break" means handling UTF-8.

"Dont break" means clever handling of embedded UTF-8 sequences, without 
conversion of every sequence to wide char values, or similar.

> Is this a practical need or an ideological one?

I consider this practical need, as I like to setup highly specialized 
chroot environments (isolating the applications to there controlled set 
of accessible data, like virtual hosts).

> Your goal keeps shifting back between supporting UTF-8 and "passing
> through sequences" which is _NOT_ supporting UTF-8. If I accidentally
> type a non-ASCII character on a command line then press backspace and
> the character disappears but there are still two hidden junk bytes on
> the command line, that is a broken system.

That would be a broken input system or read line system. In cooked mode 
you won't get that either, and in raw mode you need and can handle that 
with some simple checks. So you don't need that full multi byte and wide 
character handling to handle UTF-8.

> by with just treating text as abstract byte strings that can pass
> through anything, but for any kind of visual presentation (even on a
> terminal) or entry/editing, you need to know character identity.

Do you? I consider know. The terminal program (or font handling part) 
may need this. Even on on a simple textual Linux console, the kernel 
knows how to display the character. The only thing to know, is the 
correct number of bytes to send for a specific character.

> Fortunately, as I've been trying to say, that's extremely cheap. You
> could save more space by making the string functions or qsort or
> something naive and slow...

Do they need to be slow? UTF-8 can be handled without that full wide 
character and multi byte stuff you are throwing in, and I want to get 
bare ASCII operation plus this simple UTF-8 handling. Mostly musl does 
this, but adds in some other char set and locale handling stuff, which I 
like to opt out. On statical linking it is easy, but I like to get a 
shared library without all that extra stuff.

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.