Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250529133013.GX1827@brightrain.aerifal.cx>
Date: Thu, 29 May 2025 09:30:13 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote:
> On May 29, 2025, at 04:37, Rich Felker <dalias@...c.org> wrote:
> > Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> > per entry, for 512 bytes.
> > 
> > Second-level tables (indexed by codepoint&255):
> 
> You could also try different bit shifts that might yield smaller
> tables. Another option to compress the data further is to use
> third-level tables. I have some old code somewhere that brute-forces
> all combinations of shift values to find the smallest tables.

Indeed, that's certainly a reasonable option to consider. My original
balancing between the top- and second-level sizes was based on the
1-bit tables for isw*(), where the top-level has much higher weight
and you want to keep it small. But here, where the second-level has
more weight, it may make sense to use much smaller blocks. I'll have a
look at the distribution and see if this looks like it would be better
or worse than the cheap range-limiting shortcut I mentioned before.

Thanks for reading and offering up ideas!

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.