Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250629164822.GX1827@brightrain.aerifal.cx>
Date: Sun, 29 Jun 2025 12:48:23 -0400
From: Rich Felker <dalias@...c.org>
To: Nick Wellnhofer <wellnhofer@...um.de>
Cc: musl@...ts.openwall.com
Subject: Re: Paintcans for reverse iterating strings

On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote:
> 
> 
> > On Jun 29, 2025, at 18:03, Rich Felker <dalias@...c.org> wrote:
> > 
> > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote:
> >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@...c.org> wrote:
> >>> One thing we're going to need for LC_COLLATE in locales where
> >>> second-level weights are applied in reverse order (diacritic marks
> >>> later in the string weigh more than earlier ones) is the ability to
> >>> traverse (& live transform to NFD) the input string in reverse.
> >> 
> >> Assuming the context is strcoll and we're comparing two strings,
> >> wouldn't it be possible to compare the strings in normal, forward
> >> direction but instead of stopping at the first difference, comparing
> >> all collation elements and returning the last difference (if any)?
> > 
> > I believe you can do something like this for strcoll. Note that,
> > normally, you don't even get to second level weights when using
> > strcoll.
> > 
> > Where you can't do it is strxfrm (transforming into a byte sequence
> > that can be byte-by-byte compared).
> 
> For strxfrm, I assume you have to perform Step 3 of the UCA main
> algorithm or something equivalent:
> https://www.unicode.org/reports/tr10/#Step_3
> 
> So you just have to reverse the second-level weights afterwards
> which seems trivial. Am I missing something?

Hm, indeed maybe you can do the reversal in-place in the destination
buffer. I didn't think of that because I was assuming you needed to be
able to do the same for strcoll where you don't have a destination
buffer, but since you've found that it doesn't need to be done at all
for strcoll (because you can just keep a running last-difference and
use the final one) it seems like you always have a buffer you can do
the reversal in.

Note that this does require the weights themselves to be iterable in
reverse, which is not trivially the case in the CLDR format with
fractional byte-based weights rather than 16-bit integers, but they
can be assigned or transformed so that it's easy, or my algorithm can
just be applied to the output weight buffer (which is a lot less
costly than re-running iterators that decode, decompose, reorder, and
lookup collation elements).

Thanks!

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.