Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250630151855.GZ1827@brightrain.aerifal.cx>
Date: Mon, 30 Jun 2025 11:18:55 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Paintcans for reverse iterating strings

On Sun, Jun 29, 2025 at 12:48:23PM -0400, Rich Felker wrote:
> On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> > 
> > 
> > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@...c.org> wrote:
> > > 
> > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@...c.org> wrote:
> > >>> One thing we're going to need for LC_COLLATE in locales where
> > >>> second-level weights are applied in reverse order (diacritic marks
> > >>> later in the string weigh more than earlier ones) is the ability to
> > >>> traverse (& live transform to NFD) the input string in reverse.
> > >> 
> > >> Assuming the context is strcoll and we're comparing two strings,
> > >> wouldn't it be possible to compare the strings in normal, forward
> > >> direction but instead of stopping at the first difference, comparing
> > >> all collation elements and returning the last difference (if any)?
> > > 
> > > I believe you can do something like this for strcoll. Note that,
> > > normally, you don't even get to second level weights when using
> > > strcoll.
> > > 
> > > Where you can't do it is strxfrm (transforming into a byte sequence
> > > that can be byte-by-byte compared).
> > 
> > For strxfrm, I assume you have to perform Step 3 of the UCA main
> > algorithm or something equivalent:
> > https://www.unicode.org/reports/tr10/#Step_3
> > 
> > So you just have to reverse the second-level weights afterwards
> > which seems trivial. Am I missing something?
> 
> Hm, indeed maybe you can do the reversal in-place in the destination
> buffer. I didn't think of that because I was assuming you needed to be
> able to do the same for strcoll where you don't have a destination
> buffer, but since you've found that it doesn't need to be done at all
> for strcoll (because you can just keep a running last-difference and
> use the final one) it seems like you always have a buffer you can do
> the reversal in.
> 
> Note that this does require the weights themselves to be iterable in
> reverse, which is not trivially the case in the CLDR format with
> fractional byte-based weights rather than 16-bit integers, but they
> can be assigned or transformed so that it's easy, or my algorithm can
> just be applied to the output weight buffer (which is a lot less
> costly than re-running iterators that decode, decompose, reorder, and
> lookup collation elements).

OK, this admits a trivial solution that works regardless of the
encoding of the weights, without any need for being synchronizable for
reverse traversal.

When emitting the second-level weights during the first pass (before
reversing them), reverse the bytes for each collating element on a
byte-by-byte basis. Then, the entire span of bytes in the output
corresponding to the second level weights can be reversed,
byte-by-byte without any consideration for where the collating element
boundaries are, to produce a final bytestring consisting of collating
elements in reverse order.

Moreover, since "second order is reversed" is a property of the locale
that's fixed at the time the locale file is generated, the reversal
first reversal could even have taken place in the on-disk format, so
that it doesn't have to be processed at runtime. I'm not sure yet if
this is a good idea when strcoll is probably the more-likely case that
needs to perform well, but it could be considered.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.