musl - Re: Collation, IDN, and Unicode normalization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250529140357.GY1827@brightrain.aerifal.cx>
Date: Thu, 29 May 2025 10:03:57 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Collation, IDN, and Unicode normalization

On Thu, May 29, 2025 at 09:30:13AM -0400, Rich Felker wrote:
> On Thu, May 29, 2025 at 11:45:01AM +0200, Nick Wellnhofer wrote:
> > On May 29, 2025, at 04:37, Rich Felker <dalias@...c.org> wrote:
> > > Top-level table (indexed by codepoint>>8) to select a table: 1 byte
> > > per entry, for 512 bytes.
> > > 
> > > Second-level tables (indexed by codepoint&255):
> > 
> > You could also try different bit shifts that might yield smaller
> > tables. Another option to compress the data further is to use
> > third-level tables. I have some old code somewhere that brute-forces
> > all combinations of shift values to find the smallest tables.
> 
> Indeed, that's certainly a reasonable option to consider. My original
> balancing between the top- and second-level sizes was based on the
> 1-bit tables for isw*(), where the top-level has much higher weight
> and you want to keep it small. But here, where the second-level has
> more weight, it may make sense to use much smaller blocks. I'll have a
> look at the distribution and see if this looks like it would be better
> or worse than the cheap range-limiting shortcut I mentioned before.
> 
> Thanks for reading and offering up ideas!

Via some simple grep sed cut and uniq:

grep -E '^([^;]*;){3}[^0].*' UnicodeData.txt | cut -d';' -f1 | sed -e 's/[8-9A-F].$/80/' -e 's/[0-7].$/00/' | uniq -c

there are only 101 128-codepoint blocks vs 68 256-codepoint blocks.

This is just looking at the non-starters, since the decomposing
characters are much less of a burden already.

Extending that further, there are only 135 64-codepoint blocks. But 84
of those just have 1-3 codepoints in them, the rest still being dead
space. This suggests to me that the range check and elision of all but
the used range of the block is going to have *much* bigger returns.

Still, it's possible to combine both and that might turn out to be
really effective vs just doing the range checks. In fact that looks
like it might be the case. Looking at my output up through 0FFF, just
the 64-codepoint blocks with 1-3 entries:

3 0840
1 0900
1 0980
2 09C0
1 0A00
1 0A40
1 0A80
1 0AC0
1 0B00
1 0B40
1 0BC0
1 0C00
3 0C40
1 0C80
1 0CC0
2 0D00
1 0D40
1 0DC0
3 0E00
3 0E80
1 0FC0

we see that almost all of the 256-codepoint blocks have sporadic
entries that would require a range of at least 64 (and often 128-192)
to be mapped, but become range-1 blocks if split down on 64-codepoint
granularity.

The degree to which these are sporadic suggests an even more optimal
solution, which I don't think I want to pursue, but it should be
mentioned: with a 1bit table to exclude codepoints that are not
subject to normalization processing, I think one could compute a
perfect hash function to map all of them to a flat array with
relatively few unused slots. This would require the perfect hash
building logic to go into musl-chartable-tools. In this solution, the
expected total table size would be rougly 14-15k. That's not really a
whole lot smaller than what I expect we can get the multi-level table
down to. If you take out the actual expansions, which are 9580 bytes,
it's a better comparison. Those could be packed tighter to get it down
further, but improving them is independent of the way we map to them.

OK let's estimate the 64-codepoint block multilevel table size. There
are 51 nontrivial (more than 1-3) blocks so let's estimate them all at
full 66-byte size, and the rest (84) at 10 bytes each. That's 84*10 +
51*66 = 3366 bytes. Plus the 2048-byte top-level table. That's 5414
bytes, basically the same as the perfect hash with 1-bit rejection
table. (Note: the 1-bit rejection table can't really be optimized much
by excluding ranges, because we're down to the level of reasonable
storage granularity.) So I think this is looking really good, and
about half of the 30k we were looking at before.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.