![]() |
|
Message-ID: <20250529133013.GX1827@brightrain.aerifal.cx> Date: Thu, 29 May 2025 09:30:13 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote: > On May 29, 2025, at 04:37, Rich Felker <dalias@...c.org> wrote: > > Top-level table (indexed by codepoint>>8) to select a table: 1 byte > > per entry, for 512 bytes. > > > > Second-level tables (indexed by codepoint&255): > > You could also try different bit shifts that might yield smaller > tables. Another option to compress the data further is to use > third-level tables. I have some old code somewhere that brute-forces > all combinations of shift values to find the smallest tables. Indeed, that's certainly a reasonable option to consider. My original balancing between the top- and second-level sizes was based on the 1-bit tables for isw*(), where the top-level has much higher weight and you want to keep it small. But here, where the second-level has more weight, it may make sense to use much smaller blocks. I'll have a look at the distribution and see if this looks like it would be better or worse than the cheap range-limiting shortcut I mentioned before. Thanks for reading and offering up ideas! Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.