musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250529133013.GX1827@brightrain.aerifal.cx>
Date: Thu, 29 May 2025 09:30:13 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote:
> On May 29, 2025, at 04:37, Rich Felker <dalias@...c.org> wrote:
> > Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> > per entry, for 512 bytes.
> > 
> > Second-level tables (indexed by codepoint&255):
> 
> You could also try different bit shifts that might yield smaller
> tables. Another option to compress the data further is to use
> third-level tables. I have some old code somewhere that brute-forces
> all combinations of shift values to find the smallest tables.

Indeed, that's certainly a reasonable option to consider. My original
balancing between the top- and second-level sizes was based on the
1-bit tables for isw*(), where the top-level has much higher weight
and you want to keep it small. But here, where the second-level has
more weight, it may make sense to use much smaller blocks. I'll have a
look at the distribution and see if this looks like it would be better
or worse than the cheap range-limiting shortcut I mentioned before.

Thanks for reading and offering up ideas!

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.