[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
```Date: Thu, 5 Apr 2018 13:09:50 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Efficient case mapping

On Wed, Mar 07, 2018 at 01:25:36PM -0500, Rich Felker wrote:
> On Tue, Mar 06, 2018 at 10:48:52PM -0500, Rich Felker wrote:
> > On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> > > Further optimizing: We're not using all the data we already have. Only
> > > alphabetic characters can have case mappings, and we already have an
> > > iswalpha table; the above tables are highly correlated with it.
> > >
> > > If we short-circuit out the whole operation with iswalpha, then the
> > > "000 - no case mappings" value is no longer common. Instead, "zero is
> > > one of the common deltas" can be included or not on a per-block basis.
> > >
> > > We can also add, along with the per-block common deltas, a table of
> > > which cases each delta goes with, and assign new bit meanings:
> > >
> > > 00 - is case[0] for block, map to ~case[0] via delta[0]
> > > 01 - is case[1] for block, map to ~case[1] via delta[1]
> > > 10 - is case[2] for block, map to ~case[2] via delta[2]
> > > 11 - is exception, map via binary search of exceptions table
> > >
> > > If needed we could add back a third bit here, but I think we can get
> > > by without it.
> > >
> > > Note that, because iswalpha is now short-circuiting non-alphabetic
> > > characters, we can just assign an arbitrary table value (best: 00) to
> > > all non-alphabetic (including unassigned) characters. This drops most
> > > of the "simple blocks" (blocks with just 0 or a single delta) down to
> > > trivial (elided) blocks.
> > >
> > > These last ideas ("further optimizing:" and beyond) were worked out
> > > while writing this email, and I don't yet have any analysis/test code
> > > to go with them. Will follow up later with results. With some luck,
> > > final version will be barely-larger than current code, extensible, and
> > > fast. :)
> >
> > I now have results for this approach which I'm attaching. The output
> > is from my block analysis tool. For each block, it displays case
> > mapping rules sorted by decreasing frequency, with characters that
> > don't matter (because they're non-alphabetic) merged into the most
> > frequent rule. The syntax is:
> >
> >   rule d(t) [f]
> >
> > where t is the character type (n = no case, u = uppercase, l =
> > lowercase, x = exceptional/titlecase), d is the delta to the opposite
> > case character, and f is the frequency of the rule (again, counting
> > merging of dontcare into the most frequent).
> >
> > While the vast majority of blocks are fine with 0-2 bits of rule data,
> > a few do have more than 3 fairly-frequent rules, the worst being
> > blocks 04, 05, and 2C. Having an extra bit would save at least 169
> > exceptions in those, and thus ~676 bytes. The fixed cost of an extra
> > table bit is 512 bytes (for the first level), plus 32 bytes per block
> > that contains data, so it's pretty close to break-even, but keeping
> > the exception table small does help performance too. My leaning would
> > be to add the extra bit, but write a cost function so that it's only
> > used where the 4th-7th frequencies for the block total more than 8
> > (the number of exceptions to break even at 32 bytes).
> >
> > With that approach it looks like we have about 220 entries in the
> > exception tables, 4-5 blocks with a third bit, 21 blocks with a second
> > bit, 24 blocks with a first bit, and 3 table headers. That comes to
> > about 880+160+672+768+1536 = 4k.
> >
> > One thing I haven't accounted for is storing the 3 (or 7) deltas per
> > block. The raw deltas are signed 17-bit, but if we assume mappings
> > don't cross planes they can be thought of as unsigned 16-bit on the
> > low 16 bits of the codepoint. But that's still pretty large -- 3
> > 16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
> > lot better. First, have a table of all the deltas that ever appear
> > (about 20). Then, for each block, have 0-7 single-byte indices into
> > this table, and a single-byte index to the sequence of indices. Now we
> > just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.
> >
> > 4.6k is a good bit more than I was planning on spending here, so maybe
> > there are still improvements to be made...
>
> And here is a draft of the code that would use the mapping. The
> #ifdef'd external tables are to allow compiling the code without
> actually having table data yet; otherwise gcc optimizes out the whole
> translation unit. :-)

Updated with table output, usable.

It comes out to about 5.5k of rodata+text, compared to about 1.5k for
the awful, slow, manually-maintained code we have now.

In terms of performance, it's about 10x faster on average across all
characters with case mappings (iterating and mapping them each), and
also 10x faster iterating over all characters and mapping them,
presumably since the cost of mapping most non-casing characters is
cheap either way.

In actual numbers, on my S1260, it averages about 50ns for each case
mapping with the new code, and about 500ns with the old. Replacing the
case mapping functions with the identity function (still an external
call), the overhead seems to be 7-10ns per call, so the difference is
more like 40ns vs 490 ns.

At this point I'm fairly happy with the results except that the size
and logical complexit are somewhat higher than I had hoped for. I
wonder if it might be comparable (maybe even smaller) size to do away
with the variable sized (per-block) table entries and just always have
2-bit or 3-bit entries. It would certainly make the code simpler and
faster. I can probably work out the delta-size without actually
writing any code.

BTW this work seems to have turned up one omission in musl's current
case mappings: U+037F/U+03F3. Any ideas if there's a reason it was
omitted, or if this is just a mistake?

Rich

View attachment "casemap.c" of type "text/plain" (1550 bytes)

View attachment "casemap.h" of type "text/plain" (19098 bytes)
```