musl - Re: Efficient case mapping

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180307034852.GL1436@brightrain.aerifal.cx>
Date: Tue, 6 Mar 2018 22:48:52 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Efficient case mapping

On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> Further optimizing: We're not using all the data we already have. Only
> alphabetic characters can have case mappings, and we already have an
> iswalpha table; the above tables are highly correlated with it.
> 
> If we short-circuit out the whole operation with iswalpha, then the
> "000 - no case mappings" value is no longer common. Instead, "zero is
> one of the common deltas" can be included or not on a per-block basis.
> 
> We can also add, along with the per-block common deltas, a table of
> which cases each delta goes with, and assign new bit meanings:
> 
> 00 - is case[0] for block, map to ~case[0] via delta[0]
> 01 - is case[1] for block, map to ~case[1] via delta[1]
> 10 - is case[2] for block, map to ~case[2] via delta[2]
> 11 - is exception, map via binary search of exceptions table
> 
> If needed we could add back a third bit here, but I think we can get
> by without it.
> 
> Note that, because iswalpha is now short-circuiting non-alphabetic
> characters, we can just assign an arbitrary table value (best: 00) to
> all non-alphabetic (including unassigned) characters. This drops most
> of the "simple blocks" (blocks with just 0 or a single delta) down to
> trivial (elided) blocks.
> 
> These last ideas ("further optimizing:" and beyond) were worked out
> while writing this email, and I don't yet have any analysis/test code
> to go with them. Will follow up later with results. With some luck,
> final version will be barely-larger than current code, extensible, and
> fast. :)

I now have results for this approach which I'm attaching. The output
is from my block analysis tool. For each block, it displays case
mapping rules sorted by decreasing frequency, with characters that
don't matter (because they're non-alphabetic) merged into the most
frequent rule. The syntax is:

  rule d(t) [f]

where t is the character type (n = no case, u = uppercase, l =
lowercase, x = exceptional/titlecase), d is the delta to the opposite
case character, and f is the frequency of the rule (again, counting
merging of dontcare into the most frequent).

While the vast majority of blocks are fine with 0-2 bits of rule data,
a few do have more than 3 fairly-frequent rules, the worst being
blocks 04, 05, and 2C. Having an extra bit would save at least 169
exceptions in those, and thus ~676 bytes. The fixed cost of an extra
table bit is 512 bytes (for the first level), plus 32 bytes per block
that contains data, so it's pretty close to break-even, but keeping
the exception table small does help performance too. My leaning would
be to add the extra bit, but write a cost function so that it's only
used where the 4th-7th frequencies for the block total more than 8
(the number of exceptions to break even at 32 bytes).

With that approach it looks like we have about 220 entries in the
exception tables, 4-5 blocks with a third bit, 21 blocks with a second
bit, 24 blocks with a first bit, and 3 table headers. That comes to
about 880+160+672+768+1536 = 4k.

One thing I haven't accounted for is storing the 3 (or 7) deltas per
block. The raw deltas are signed 17-bit, but if we assume mappings
don't cross planes they can be thought of as unsigned 16-bit on the
low 16 bits of the codepoint. But that's still pretty large -- 3
16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
lot better. First, have a table of all the deltas that ever appear
(about 20). Then, for each block, have 0-7 single-byte indices into
this table, and a single-byte index to the sequence of indices. Now we
just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.

4.6k is a good bit more than I was planning on spending here, so maybe
there are still improvements to be made...

Rich

View attachment "caseblocks1.txt" of type "text/plain" (4406 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.