musl - [PATCH] Byte-based C locale, draft 2

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150613070655.GJ17573@brightrain.aerifal.cx>
Date: Sat, 13 Jun 2015 03:06:55 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: [PATCH] Byte-based C locale, draft 2

On Sat, Jun 06, 2015 at 10:50:25PM -0400, Rich Felker wrote:
> On Sat, Jun 06, 2015 at 05:40:07PM -0400, Rich Felker wrote:
> > Attached is the first draft of a proposed byte-based C locale. The
> > patch is about 400 lines but most of it is context, because it's
> > basically a lot of tiny changes spread out over lots of files.
> > [...]
> 
> If we go forward with this, I think I can factor it into 3 parts:
> 
> 1. Add checks for MB_CUR_MAX==1 and the bytelocale support they would
>    activate, and the CODEUNIT/IS_CODEUNIT macros needed for these code
>    paths. This patch would be a complete nop and would not even affect
>    codegen with a decent compiler since MB_CUR_MAX==4 is a constant
>    right now.
> 
> 2. Introduce stdio saving of active LC_CTYPE at the time of stream
>    orientation (fwide) and save/restore of current locale around stdio
>    ops that need it (fputwc, fgetwc, ungetwc) and iconv usage of
>    multibyte functions. This patch would increase code size in a few
>    places but would not change behavior.
> 
> 3. Replace the constant MB_CUR_MAX macro with a runtime-variable value
>    dependent on CURRENT_LOCALE->cat[LC_CTYPE]. This would actually
>    activate the byte-based C locale support. locale_impl.h is actually
>    already doing this, so I think I should remove that definition
>    before making any changes and only bring it back if/when stage 3
>    here is committed.
> 
> In principle stages 1 and 2 could be committed in either order;
> they're independent. Stage 3 is also independent in what it touches,
> but if it's already committed before stage 1/2, then committing stage
> 1 without stage 2 is a functional regression (stdio functions no
> longer behave according to spec; iconv stops working in C locale).

Attached is the 3-part factorization described above, as patches
against commit 536c6d5a4205e2a3f161f2983ce1e0ac3082187d.

As predicted, part 1 does not change the generated code at all, at
least for my toolchain.

If nobody has further comments/discussion, I'll probably begin
committing this soon, starting with part 1, and the rest as I test it
more. While in the past there were ideological objections, including
by myself, all the feedback this time has been from people who want
this feature for compatibility (and future standards conformance), and
I think I've managed to do it in a way that's basically cost-free and
does not compromise musl's principle that UTF-8 is first-class, but
instead just gives you a way (only if/when you want it) to process
UTF-8 as code units instead of codepoints.

Rich

View attachment "bytelocale-part1.diff" of type "text/plain" (5495 bytes)

View attachment "bytelocale-part2.diff" of type "text/plain" (5126 bytes)

View attachment "bytelocale-part3.diff" of type "text/plain" (784 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.