musl - Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct by 53%

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO6moYvAV=zpexNzA9hVW582n866ntJZBhZk9Fn782qiNgYYiA@mail.gmail.com>
Date: Wed, 21 Jan 2026 00:40:58 +1100
From: Xan Phung <xan.phung@...il.com>
To: Xan Phung <xan.phung@...il.com>, Openwall musl <musl@...ts.openwall.com>
Subject: Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct
 by 53%

On Tue, 20 Jan 2026 at 23:36, Szabolcs Nagy <nsz@...t70.net> wrote:

> * Xan Phung <xan.phung@...il.com> [2026-01-20 05:25:16 +0800]:
> > On Tue, 20 Jan 2026, 3:55 am Szabolcs Nagy, <nsz@...t70.net> wrote:
> > > note that decimal '255' is shorter than '0xff' so decimal
> > > means a bit less data in the source code.
> > >
> >
> >
> > Ha, ha, ha :)  Not sure if you are joking, but when I was referring to
> the
> > 'text size' I mean the compiled code and constant data size reported by
> the
> > 'size' tool, not the source code size.
> >
> > But if you are saying musl coding standards prefer decimal due to it's
> more
> > compact source code data size, then yes I can make decimal source files
> > instead of hexadecimal.
>
> this is a minor comment,
> but the source code size matters too as musl is distributed in
> source form too. decimal data likely deflate compresses better
> so this affects git history size too. obviously for actual code
> the readability is more important, but for data there is no
> such constraint thus we can choose, so why use hex?
>
>
Thanks for the clarification.
I am happy to do whichever number base best fits with musl's conventions.


> more importantly this code looks endian dependent:
>
> > > > +const static unsigned char table[PAGEH + 227*sizeof(unsigned)] = {
> > > > +#include "iswalpha_table.h"
> > > >  };
>
> byte order matters as table data is not equal per 4 byte chunks:
>
>
Yes, good point.
What I can do is for the bytes that need to be grouped as words, I can
surround them with a macro.
The header will look something like:

WORD(0x55,0x25,0x55,0x55),0x46,0x01,0x3f,0x7f,0xd5,0x2d,0x00,0x00,

The macro WORD will permute the bytes as needed for the machine's
endianness.
(Or does musl already have a macro that does this?)

...
>
> +0xec,0xc7,0x3d,0xd6,0x18,0xc7,0xff,0xc3,0xc7,0x1d,0x81,0x00,0xc0,0xff,0x00,0x00,
> +0x00,0x00,0x00,0x00,0x55,0x55,0x55,0x55,
>
> > > > +int iswalpha(wint_t wc) {
> > > > +     unsigned *huffm = (unsigned *)(table + PAGEH), target;
>
> deref of huffm is technically an aliasing violation and thus ub.
>
> > > > +     huff = huffm[base];
> > > > +     base+= (rev = -(page & 1)) + 1;
> > > > +     type = (huff >> (2 * lane)) & 3;
> > > > +     popc = (huff << (31 - 2 * lane));
>
> in such cases i think the cleanest solution is
>
> static const struct {
> unsigned char tab1[PAGEH];
> unsigned tab2[227];
> } data = {...};
>
the struct helps on targets where computing the
> address of a global takes multiple instructions.
> (otherwise you could just use two separate tabs)
>

Thanks. I will take up your suggestion.

If the WORD macro for word ordering of bytes above is acceptable to you, I
will go ahead and prepare a new patch.

This new patch will also contain an expansion of the iswalpha 'direct'
bitmap to cover the entire first 0x1000 codepoints (which will increase
data size by ~200 bytes, as my 1st patch only used it for a small 0x200
code point range).

This will be done in a highly performant way, such that for 220 out of 256
BMP code pages, the performance is *better* than the status quo musl code
(as it will only be a single level memory access). The remaining 36 BMP
code pages are 'niche' languages (eg. Unified Canadian Aboriginal, Vai,
Khmer, etc) or technical & symbology blocks.  Such niche code pages are
very branch predictor friendly (eg. if a text string is in the Khmer
language, it is likely that branch prediction will rapidly retrain to
correctly predict not to take the 'direct' path).  As a preview of how
performant the code is, the critical path involves (i) computing address &
index of table[] (3 cycles, including a conditional select), (ii) memory
load (3-4 cycles), (iii) shift by the amount '(wc & -direct & 7)', which
only depends on 'wc' and hence can be pre-calculated before load is
completed - so the shift + mask will only take 2 cycles, once data is
retrieved from memory:

/* Direct path used in all but 36 of the 256 BMP code pages */
direct = wc < 0x1000;
page = (wc >> PAGE_SH);
bmap = (wc >> 3) + PAGES;
base = table[direct ? bmap:page];
if (base <= 1 | direct)
    return base >> (wc & -direct & 7) & 1;

/* 2nd and 3rd level array access are the same as before ... */

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.