Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO6moYu2BqWWyvpTvheEaaup-c0piP3TV-0udHpKa9rzgD5r-w@mail.gmail.com>
Date: Tue, 20 Jan 2026 05:25:16 +0800
From: Xan Phung <xan.phung@...il.com>
To: Xan Phung <xan.phung@...il.com>, Openwall musl <musl@...ts.openwall.com>
Subject: Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct
 by 53%

On Tue, 20 Jan 2026, 3:55 am Szabolcs Nagy, <nsz@...t70.net> wrote:

> * Xan Phung <xan.phung@...il.com> [2026-01-20 01:35:26 +1100]:
> > Currently iswalpha and iswpunct have total text data size of over
> > 8kb. A more efficient encoding has reduced the total size to 3.75kb
> > (2kb and 1.8kb respectively), a 53% reduction.
> >
> > The new encoding remains a (mostly) branchless table lookup, but now
> > requires 3 memory accesses instead of 2. It remains optimized for
> > random access decoding. The top level remains much the same,
> > providing 8 bit offsets into codepage units (256 codepoint
> > granularity). The second level data uses fixed sizes, of one 32 bit
> > word per codepage (where each 2 bit pair in word identifies a block
> > of 16 codepoints as all 0, all 1, or mixed). The third level is a
> > variable length series of extension bytes, indexed by the popcount
> > of set high bits within the second level's 32 bit word. This
> > popcount is calculated with nearly same latency as a 32 bit multiply
> > (so it is comparable with the indexing speed of accessing a 2D array
> > of non power of 2 size).
>
> nice.
>
> >
> > Results have been tested against first 0x20000 codepoints, and match
> > that returned by the pre-existing musl implementation.
>
> did you run benchmarks too? given that the new code
> uses more operations it would be useful to know how
> it performs.


New code uses more operations (that needs around 10 extra cpu cycles
compared to old code), but on the other hand has a much better caching
footprint. For example, a cache line can hold twice as many codepoints in
new code vs old code which means up to 50% reduction in cache misses (with
each L1 cache miss having a 20-200 cycle penalty).

I can add benchmarking output to the tool I use to generate and verify the
data file, but before I do that, I want to get input on:

(i) What the performance vs data size preference is. For example, by
expanding the one level direct bitmap (currently used in iswalpha only for
the difficult to compress codepoints in the 0xa00 to 0xc00 range) to a more
extended set of codepoints in range 0x0 to 0x1000, this adds around 300
bytes to data size to iswalpha, but will speed up access to these code
points (which contain the most important codepoints in general use) by up
to 2 levels of memory access, whilst still being a 40~45% size reduction
(compared to old code)

(ii) Whether a 64 bit word version would be accepted (currently I have used
a maximum data word size of 32 bits).

Using 64 bit word data size at the 2nd indexing level would have better
compression because it would enable larger 512 codepoint page sizes, and
hence the top level index would only be 256 bytes instead of 512 bytes, and
the net saving is up to 120-150 bytes after some losses due alignment
wastage etc.

A 64b word size also enables a potential performance optimisation, as the
3rd level data can be loaded in parallel with 2nd level data (effectively
collapsing the table hierarchy back to 2 levels), with the popcount then
being used as index for a 64b barrel shift, rather than as byte index for
3rd level memory load.

Hashing (hence potentially reducing memory access hierarchy to as little as
one load) is also viable at a 64b word size, but this comes in at approx
10% larger data size.


>
> note that decimal '255' is shorter than '0xff' so decimal
> means a bit less data in the source code.
>


Ha, ha, ha :)  Not sure if you are joking, but when I was referring to the
'text size' I mean the compiled code and constant data size reported by the
'size' tool, not the source code size.

But if you are saying musl coding standards prefer decimal due to it's more
compact source code data size, then yes I can make decimal source files
instead of hexadecimal.



>
>
> > --- a/src/ctype/iswalpha.c
> > +++ b/src/ctype/iswalpha.c
> > @@ -1,16 +1,54 @@
> >  #include <wctype.h>
> >
> > -static const unsigned char table[] = {
> > -#include "alpha.h"
> > +#define PAGE_SH   8
> > +#define PAGE_MAX  (1u << PAGE_SH)
> > +#define PAGES     (0x20000 / PAGE_MAX)
> > +#define PAGEH     (0x200/8 + PAGES)    /* 0x200 direct mapped codepts */
> > +
> > +const static unsigned char table[PAGEH + 227*sizeof(unsigned)] = {
> > +#include "iswalpha_table.h"
> >  };
> >
> > -int iswalpha(wint_t wc)
> > -{
> > -     if (wc<0x20000U)
> > -             return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
> > -     if (wc<0x2fffeU)
> > -             return 1;
> > -     return 0;
> > +const static unsigned short dict[119] = {
> > +#include "iswalpha_dict.h"
> > +};
> > +
> > +int iswalpha(wint_t wc) {
> > +     unsigned *huffm = (unsigned *)(table + PAGEH), target;
> > +     unsigned page, shfr, lane, base, rev;
> > +     unsigned huff, type, popc, fast;
> > +     signed char ext;
> > +
> > +     /* Uncommon codepoints, skipped by branch predictors */
> > +     if ((unsigned)wc >= 0x20000)
> > +             return (unsigned) wc < 0x2fffe;
> > +     if ((unsigned)wc-0xa00 < 0x200)
> > +             return table[PAGES + ((wc-0xa00)>>3)] >> (wc & 7) & 1;
> > +
> > +     /* Three level lookup, final level index = popc^rev_direction */
> > +     target = wc & (PAGE_MAX-1);
> > +     page = wc >> PAGE_SH;
> > +     shfr = target & 15;
> > +     lane = target >> 4;
> > +     base = table[page];
> > +     huff = huffm[base];
> > +     base+= (rev = -(page & 1)) + 1;
> > +     type = (huff >> (2 * lane)) & 3;
> > +     popc = (huff << (31 - 2 * lane));
> > +     popc = (popc & 0x11111111) + ((popc & 0x44444444) >> 2);
> > +     popc = (popc * 0x11111111) >> 28;
> > +     ext  = (table + PAGEH + base*4)[(int)(popc^rev)];
> > +
> > +     /* Fast (1st) path precalcs shfr before ext loaded from mem   */
> > +     /* Dictionary lookup slow (2nd) path is only 1% of codepoints */
> > +     fast = (type != 2 | ext & 1);
> > +     if (fast) {
> > +             shfr = (shfr + 5) & -(type >= 2);
> > +             shfr = (shfr - (6 & -(type == 2)) + type);
> > +             return (ext << 8 | 0xfe) >> shfr & 1;
> > +     } else {
> > +             return dict[ext >> 1 & 0x7f] >> shfr & 1;
> > +     }
> >  }
>

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.