musl - Re: Paintcans for reverse iterating strings

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250630151855.GZ1827@brightrain.aerifal.cx>
Date: Mon, 30 Jun 2025 11:18:55 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Paintcans for reverse iterating strings

On Sun, Jun 29, 2025 at 12:48:23PM -0400, Rich Felker wrote:
> On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> > 
> > 
> > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@...c.org> wrote:
> > > 
> > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@...c.org> wrote:
> > >>> One thing we're going to need for LC_COLLATE in locales where
> > >>> second-level weights are applied in reverse order (diacritic marks
> > >>> later in the string weigh more than earlier ones) is the ability to
> > >>> traverse (& live transform to NFD) the input string in reverse.
> > >> 
> > >> Assuming the context is strcoll and we're comparing two strings,
> > >> wouldn't it be possible to compare the strings in normal, forward
> > >> direction but instead of stopping at the first difference, comparing
> > >> all collation elements and returning the last difference (if any)?
> > > 
> > > I believe you can do something like this for strcoll. Note that,
> > > normally, you don't even get to second level weights when using
> > > strcoll.
> > > 
> > > Where you can't do it is strxfrm (transforming into a byte sequence
> > > that can be byte-by-byte compared).
> > 
> > For strxfrm, I assume you have to perform Step 3 of the UCA main
> > algorithm or something equivalent:
> > https://www.unicode.org/reports/tr10/#Step_3
> > 
> > So you just have to reverse the second-level weights afterwards
> > which seems trivial. Am I missing something?
> 
> Hm, indeed maybe you can do the reversal in-place in the destination
> buffer. I didn't think of that because I was assuming you needed to be
> able to do the same for strcoll where you don't have a destination
> buffer, but since you've found that it doesn't need to be done at all
> for strcoll (because you can just keep a running last-difference and
> use the final one) it seems like you always have a buffer you can do
> the reversal in.
> 
> Note that this does require the weights themselves to be iterable in
> reverse, which is not trivially the case in the CLDR format with
> fractional byte-based weights rather than 16-bit integers, but they
> can be assigned or transformed so that it's easy, or my algorithm can
> just be applied to the output weight buffer (which is a lot less
> costly than re-running iterators that decode, decompose, reorder, and
> lookup collation elements).

OK, this admits a trivial solution that works regardless of the
encoding of the weights, without any need for being synchronizable for
reverse traversal.

When emitting the second-level weights during the first pass (before
reversing them), reverse the bytes for each collating element on a
byte-by-byte basis. Then, the entire span of bytes in the output
corresponding to the second level weights can be reversed,
byte-by-byte without any consideration for where the collating element
boundaries are, to produce a final bytestring consisting of collating
elements in reverse order.

Moreover, since "second order is reversed" is a property of the locale
that's fixed at the time the locale file is generated, the reversal
first reversal could even have taken place in the on-disk format, so
that it doesn't have to be processed at runtime. I'm not sure yet if
this is a good idea when strcoll is probably the more-likely case that
needs to perform well, but it could be considered.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.