Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260120150911.GR1827@brightrain.aerifal.cx>
Date: Tue, 20 Jan 2026 10:09:11 -0500
From: Rich Felker <dalias@...c.org>
To: Xan Phung <xan.phung@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH v1] wctype: reduce text size of iswalpha & iswpunct
 by 53%

On Tue, Jan 20, 2026 at 01:35:26AM +1100, Xan Phung wrote:
> Currently iswalpha and iswpunct have total text data size of over
> 8kb. A more efficient encoding has reduced the total size to 3.75kb
> (2kb and 1.8kb respectively), a 53% reduction.
> 
> The new encoding remains a (mostly) branchless table lookup, but now
> requires 3 memory accesses instead of 2. It remains optimized for
> random access decoding. The top level remains much the same,
> providing 8 bit offsets into codepage units (256 codepoint
> 
> granularity). The second level data uses fixed sizes, of one 32 bit
> word per codepage (where each 2 bit pair in word identifies a block
> of 16 codepoints as all 0, all 1, or mixed). The third level is a
> variable length series of extension bytes, indexed by the popcount
> of set high bits within the second level's 32 bit word. This
> popcount is calculated with nearly same latency as a 32 bit multiply
> (so it is comparable with the indexing speed of accessing a 2D array
> of non power of 2 size).
> 
> Results have been tested against first 0x20000 codepoints, and match
> that returned by the pre-existing musl implementation.

How are the tables generated? Do you have a patch to or replacement
for the musl-chartable-tools code that generates the existing tables?
It needs to be possible to update them easily for new Unicode
additions, and we need to be confident that the format can efficiently
represent any likely future additions.

> diff --git a/src/ctype/iswalpha.c b/src/ctype/iswalpha.c
> index 1c5485d..e3b7037 100644
> --- a/src/ctype/iswalpha.c
> +++ b/src/ctype/iswalpha.c
> @@ -1,16 +1,54 @@
>  #include <wctype.h>
>  
> -static const unsigned char table[] = {
> -#include "alpha.h"
> +#define PAGE_SH   8
> +#define PAGE_MAX  (1u << PAGE_SH)
> +#define PAGES     (0x20000 / PAGE_MAX)
> +#define PAGEH     (0x200/8 + PAGES)    /* 0x200 direct mapped codepts */
> +
> +const static unsigned char table[PAGEH + 227*sizeof(unsigned)] = {
> +#include "iswalpha_table.h"
>  };
>  
> -int iswalpha(wint_t wc)
> -{
> -	if (wc<0x20000U)
> -		return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
> -	if (wc<0x2fffeU)
> -		return 1;
> -	return 0;
> +const static unsigned short dict[119] = {
> +#include "iswalpha_dict.h"
> +};
> +
> +int iswalpha(wint_t wc) {
> +	unsigned *huffm = (unsigned *)(table + PAGEH), target;

The aliasing violation here isn't acceptable. Is there any good way to
avoid it? I've only skimmed this submission so far; I assume the
motivation is doing the popcount across 4 bytes, but I don't yet
understand why popcount was chosen as the way of storing an index.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.