musl - Re: Collation weight length frequencies

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260618215256.GN27423@brightrain.aerifal.cx>
Date: Thu, 18 Jun 2026 17:52:56 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation weight length frequencies

On Sat, May 23, 2026 at 05:02:18PM -0400, Rich Felker wrote:
> However, what it also suggests is maybe having a data-defined
> dictionary of header byte values. which would avoid locking in any
> assumptions about the FractionalUCA.txt implementation of root data
> and instead allow any assignment (e.g. not even necessarily
> variable-length/fractional) of weights as non-null byte sequences
> compatible with strxfrm.
> 
> From an immediate practical standpoint, this would facilitate eliding
> not just one common secondary/tertiary byte value (05), but basically
> all of the common secondary/tertiary weight bytes. So that, instead of
> most entries in the table being one of (4-6 bytes):
> 
> - hh pp pp pp ss tt
> - hh pp pp ss tt
> - hh pp ss tt
> 
> most would be (2-4 bytes):
> 
> - hh pp pp pp
> - hh pp pp
> - hh pp
> 
> We could probably take this even further and let header byte represent
> a lead primary byte too. This would drop the 87k ideographic collation
> elements from 4 bytes each to 3 bytes each.

I don't really see a lot of advantage in special-casing

    "header dictionary defines the weights for a given level entirely
    or just defines the length in bytes and the weight bytes
    themselves are in the rules"

versus just doing:

    "header dictionary defines the total length and a shared prefix of
    the weight bytes".

It's possible that the former could be slightly faster to process, but
I don't think the differences will be significant.

If we go with the latter approach, the information content of header
dictionary entries is essentially:

    l1 l2 l3 p1 p2 p3

Where l1..3 are lengths (1 byte each is plenty) and p1..p3 are each
null-terminated byte sequences making up prefixes.

So a header dictionary entry for collation elements which are 1-byte
primary weight with default secondary and tertiary weight would look
like (05 being default 2-/3-ary weight):

    01 01 01 00 05 00 05 00

if the lengths are totals including the prefixes, or:

    01 00 00 00 05 00 05 00

if they just represent the number of additional bytes beyond the
prefix.

However, this form isn't very efficient to process. There's a lot of
scanning for null terminators to find lengths and where the next field
starts, and this is needs to be done in passes even where the data
won't be used -- for example, the first pass of strxfrm needs to
measure the total lengths of the secondary and tertiary weights to
check whether to fail for insufficient space and to prep for emitting
secondary weights in reverse order in locales where that's the needed
behavior.

Since there are at most 254 header dictionary entries, there's no
reason to make them tightly packed. It's better to just focus on
having them efficient to access/process. My leaning is something like
(all numeric fields single unsigned bytes):

Opening header:
- total length for weights at all levels
- total length explicitly stored in mapping (not implicit in header),
  aka offset to next collation element in the mapping

For each level:
- offset to explicit tail
- length of explicit tail
- offset to common prefix bytes
- length of common prefix bytes

For the above example, this would yield:

    03 01 00 01 00 00 00 00 0e 01 00 00 0f 01 05 05

Having total lengths makes it easy to sum without examining data.
Having total explicit length makes it easy to advance over a collation
element in a mapping to get to the next collation element.

This doesn't cover how we represent outliers that don't fit in one of
254 possible header bytes. I think the way we handle them is just
special-casing an otherwise-invalid case to mean "uncompressed" and
processing accordingly as described before:

> In such a system, a single value should be reserved for representing
> arbitrary lengths. Rather than any fancy encoding for them, I think
> I'd have the 3 levels be null-terminated. Where rr is the reserved
> header value, and pp ss and tt are bytes of the respective weight
> levels:
> 
> rr pp pp ... pp 00 ss ... ss 00 tt .. tt 00
> 
> This of course is not maximally efficient, but it's roughly only 1
> byte worse than storing 3 lengths with sufficient range, and would
> only be intended for exceptional cases.

An alternative that'd be a bit less costly to process is replacing the
null terminations with cumulative lengths l1 l2 l3, where l3 is the
total length, l3-l2 is the lenth of tertiary weight, l2-l1 the length
of secondary weight, and l1 is the length of primary. This avoids any
scanning of data when we just need the lengths.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.