![]() |
|
Message-ID: <20250529140357.GY1827@brightrain.aerifal.cx> Date: Thu, 29 May 2025 10:03:57 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Collation, IDN, and Unicode normalization On Thu, May 29, 2025 at 09:30:13AM -0400, Rich Felker wrote: > On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote: > > On May 29, 2025, at 04:37, Rich Felker <dalias@...c.org> wrote: > > > Top-level table (indexed by codepoint>>8) to select a table: 1 byte > > > per entry, for 512 bytes. > > > > > > Second-level tables (indexed by codepoint&255): > > > > You could also try different bit shifts that might yield smaller > > tables. Another option to compress the data further is to use > > third-level tables. I have some old code somewhere that brute-forces > > all combinations of shift values to find the smallest tables. > > Indeed, that's certainly a reasonable option to consider. My original > balancing between the top- and second-level sizes was based on the > 1-bit tables for isw*(), where the top-level has much higher weight > and you want to keep it small. But here, where the second-level has > more weight, it may make sense to use much smaller blocks. I'll have a > look at the distribution and see if this looks like it would be better > or worse than the cheap range-limiting shortcut I mentioned before. > > Thanks for reading and offering up ideas! Via some simple grep sed cut and uniq: grep -E '^([^;]*;){3}[^0].*' UnicodeData.txt | cut -d';' -f1 | sed -e 's/[8-9A-F].$/80/' -e 's/[0-7].$/00/' | uniq -c there are only 101 128-codepoint blocks vs 68 256-codepoint blocks. This is just looking at the non-starters, since the decomposing characters are much less of a burden already. Extending that further, there are only 135 64-codepoint blocks. But 84 of those just have 1-3 codepoints in them, the rest still being dead space. This suggests to me that the range check and elision of all but the used range of the block is going to have *much* bigger returns. Still, it's possible to combine both and that might turn out to be really effective vs just doing the range checks. In fact that looks like it might be the case. Looking at my output up through 0FFF, just the 64-codepoint blocks with 1-3 entries: 3 0840 1 0900 1 0980 2 09C0 1 0A00 1 0A40 1 0A80 1 0AC0 1 0B00 1 0B40 1 0BC0 1 0C00 3 0C40 1 0C80 1 0CC0 2 0D00 1 0D40 1 0DC0 3 0E00 3 0E80 1 0FC0 we see that almost all of the 256-codepoint blocks have sporadic entries that would require a range of at least 64 (and often 128-192) to be mapped, but become range-1 blocks if split down on 64-codepoint granularity. The degree to which these are sporadic suggests an even more optimal solution, which I don't think I want to pursue, but it should be mentioned: with a 1bit table to exclude codepoints that are not subject to normalization processing, I think one could compute a perfect hash function to map all of them to a flat array with relatively few unused slots. This would require the perfect hash building logic to go into musl-chartable-tools. In this solution, the expected total table size would be rougly 14-15k. That's not really a whole lot smaller than what I expect we can get the multi-level table down to. If you take out the actual expansions, which are 9580 bytes, it's a better comparison. Those could be packed tighter to get it down further, but improving them is independent of the way we map to them. OK let's estimate the 64-codepoint block multilevel table size. There are 51 nontrivial (more than 1-3) blocks so let's estimate them all at full 66-byte size, and the rest (84) at 10 bytes each. That's 84*10 + 51*66 = 3366 bytes. Plus the 2048-byte top-level table. That's 5414 bytes, basically the same as the perfect hash with 1-bit rejection table. (Note: the 1-bit rejection table can't really be optimized much by excluding ranges, because we're down to the level of reasonable storage granularity.) So I think this is looking really good, and about half of the 30k we were looking at before. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.