Date: Tue, 6 Mar 2018 22:48:52 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Efficient case mapping On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote: > Further optimizing: We're not using all the data we already have. Only > alphabetic characters can have case mappings, and we already have an > iswalpha table; the above tables are highly correlated with it. > > If we short-circuit out the whole operation with iswalpha, then the > "000 - no case mappings" value is no longer common. Instead, "zero is > one of the common deltas" can be included or not on a per-block basis. > > We can also add, along with the per-block common deltas, a table of > which cases each delta goes with, and assign new bit meanings: > > 00 - is case for block, map to ~case via delta > 01 - is case for block, map to ~case via delta > 10 - is case for block, map to ~case via delta > 11 - is exception, map via binary search of exceptions table > > If needed we could add back a third bit here, but I think we can get > by without it. > > Note that, because iswalpha is now short-circuiting non-alphabetic > characters, we can just assign an arbitrary table value (best: 00) to > all non-alphabetic (including unassigned) characters. This drops most > of the "simple blocks" (blocks with just 0 or a single delta) down to > trivial (elided) blocks. > > These last ideas ("further optimizing:" and beyond) were worked out > while writing this email, and I don't yet have any analysis/test code > to go with them. Will follow up later with results. With some luck, > final version will be barely-larger than current code, extensible, and > fast. :) I now have results for this approach which I'm attaching. The output is from my block analysis tool. For each block, it displays case mapping rules sorted by decreasing frequency, with characters that don't matter (because they're non-alphabetic) merged into the most frequent rule. The syntax is: rule d(t) [f] where t is the character type (n = no case, u = uppercase, l = lowercase, x = exceptional/titlecase), d is the delta to the opposite case character, and f is the frequency of the rule (again, counting merging of dontcare into the most frequent). While the vast majority of blocks are fine with 0-2 bits of rule data, a few do have more than 3 fairly-frequent rules, the worst being blocks 04, 05, and 2C. Having an extra bit would save at least 169 exceptions in those, and thus ~676 bytes. The fixed cost of an extra table bit is 512 bytes (for the first level), plus 32 bytes per block that contains data, so it's pretty close to break-even, but keeping the exception table small does help performance too. My leaning would be to add the extra bit, but write a cost function so that it's only used where the 4th-7th frequencies for the block total more than 8 (the number of exceptions to break even at 32 bytes). With that approach it looks like we have about 220 entries in the exception tables, 4-5 blocks with a third bit, 21 blocks with a second bit, 24 blocks with a first bit, and 3 table headers. That comes to about 880+160+672+768+1536 = 4k. One thing I haven't accounted for is storing the 3 (or 7) deltas per block. The raw deltas are signed 17-bit, but if we assume mappings don't cross planes they can be thought of as unsigned 16-bit on the low 16 bits of the codepoint. But that's still pretty large -- 3 16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a lot better. First, have a table of all the deltas that ever appear (about 20). Then, for each block, have 0-7 single-byte indices into this table, and a single-byte index to the sequence of indices. Now we just have about 2*20+512+5*7+16*3+3*1 = 638 bytes. 4.6k is a good bit more than I was planning on spending here, so maybe there are still improvements to be made... Rich View attachment "caseblocks1.txt" of type "text/plain" (4406 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.