musl - Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct by 53%

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260119195458.GX3520958@port70.net>
Date: Mon, 19 Jan 2026 20:54:58 +0100
From: Szabolcs Nagy <nsz@...t70.net>
To: Xan Phung <xan.phung@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct
 by 53%

* Xan Phung <xan.phung@...il.com> [2026-01-20 01:35:26 +1100]:
> Currently iswalpha and iswpunct have total text data size of over
> 8kb. A more efficient encoding has reduced the total size to 3.75kb
> (2kb and 1.8kb respectively), a 53% reduction.
> 
> The new encoding remains a (mostly) branchless table lookup, but now
> requires 3 memory accesses instead of 2. It remains optimized for
> random access decoding. The top level remains much the same,
> providing 8 bit offsets into codepage units (256 codepoint
> granularity). The second level data uses fixed sizes, of one 32 bit
> word per codepage (where each 2 bit pair in word identifies a block
> of 16 codepoints as all 0, all 1, or mixed). The third level is a
> variable length series of extension bytes, indexed by the popcount
> of set high bits within the second level's 32 bit word. This
> popcount is calculated with nearly same latency as a 32 bit multiply
> (so it is comparable with the indexing speed of accessing a 2D array
> of non power of 2 size).

nice.

> 
> Results have been tested against first 0x20000 codepoints, and match
> that returned by the pre-existing musl implementation.

did you run benchmarks too? given that the new code
uses more operations it would be useful to know how
it performs.

note that decimal '255' is shorter than '0xff' so decimal
means a bit less data in the source code.

> --- a/src/ctype/iswalpha.c
> +++ b/src/ctype/iswalpha.c
> @@ -1,16 +1,54 @@
>  #include <wctype.h>
>  
> -static const unsigned char table[] = {
> -#include "alpha.h"
> +#define PAGE_SH   8
> +#define PAGE_MAX  (1u << PAGE_SH)
> +#define PAGES     (0x20000 / PAGE_MAX)
> +#define PAGEH     (0x200/8 + PAGES)    /* 0x200 direct mapped codepts */
> +
> +const static unsigned char table[PAGEH + 227*sizeof(unsigned)] = {
> +#include "iswalpha_table.h"
>  };
>  
> -int iswalpha(wint_t wc)
> -{
> -	if (wc<0x20000U)
> -		return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
> -	if (wc<0x2fffeU)
> -		return 1;
> -	return 0;
> +const static unsigned short dict[119] = {
> +#include "iswalpha_dict.h"
> +};
> +
> +int iswalpha(wint_t wc) {
> +	unsigned *huffm = (unsigned *)(table + PAGEH), target;
> +	unsigned page, shfr, lane, base, rev;
> +	unsigned huff, type, popc, fast;
> +	signed char ext;
> +
> +	/* Uncommon codepoints, skipped by branch predictors */
> +	if ((unsigned)wc >= 0x20000)
> +		return (unsigned) wc < 0x2fffe;
> +	if ((unsigned)wc-0xa00 < 0x200)
> +		return table[PAGES + ((wc-0xa00)>>3)] >> (wc & 7) & 1;
> +
> +	/* Three level lookup, final level index = popc^rev_direction */
> +	target = wc & (PAGE_MAX-1);
> +	page = wc >> PAGE_SH;
> +	shfr = target & 15;
> +	lane = target >> 4;
> +	base = table[page];
> +	huff = huffm[base];
> +	base+= (rev = -(page & 1)) + 1;
> +	type = (huff >> (2 * lane)) & 3;
> +	popc = (huff << (31 - 2 * lane));
> +	popc = (popc & 0x11111111) + ((popc & 0x44444444) >> 2);
> +	popc = (popc * 0x11111111) >> 28;
> +	ext  = (table + PAGEH + base*4)[(int)(popc^rev)];
> +
> +	/* Fast (1st) path precalcs shfr before ext loaded from mem   */
> +	/* Dictionary lookup slow (2nd) path is only 1% of codepoints */
> +	fast = (type != 2 | ext & 1);
> +	if (fast) {
> +		shfr = (shfr + 5) & -(type >= 2);
> +		shfr = (shfr - (6 & -(type == 2)) + type);
> +		return (ext << 8 | 0xfe) >> shfr & 1;
> +	} else {
> +		return dict[ext >> 1 & 0x7f] >> shfr & 1;
> +	}
>  }
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.