![]() |
|
Message-ID: <20250629164822.GX1827@brightrain.aerifal.cx> Date: Sun, 29 Jun 2025 12:48:23 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Paintcans for reverse iterating strings On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote: > > > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@...c.org> wrote: > > > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote: > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@...c.org> wrote: > >>> One thing we're going to need for LC_COLLATE in locales where > >>> second-level weights are applied in reverse order (diacritic marks > >>> later in the string weigh more than earlier ones) is the ability to > >>> traverse (& live transform to NFD) the input string in reverse. > >> > >> Assuming the context is strcoll and we're comparing two strings, > >> wouldn't it be possible to compare the strings in normal, forward > >> direction but instead of stopping at the first difference, comparing > >> all collation elements and returning the last difference (if any)? > > > > I believe you can do something like this for strcoll. Note that, > > normally, you don't even get to second level weights when using > > strcoll. > > > > Where you can't do it is strxfrm (transforming into a byte sequence > > that can be byte-by-byte compared). > > For strxfrm, I assume you have to perform Step 3 of the UCA main > algorithm or something equivalent: > https://www.unicode.org/reports/tr10/#Step_3 > > So you just have to reverse the second-level weights afterwards > which seems trivial. Am I missing something? Hm, indeed maybe you can do the reversal in-place in the destination buffer. I didn't think of that because I was assuming you needed to be able to do the same for strcoll where you don't have a destination buffer, but since you've found that it doesn't need to be done at all for strcoll (because you can just keep a running last-difference and use the final one) it seems like you always have a buffer you can do the reversal in. Note that this does require the weights themselves to be iterable in reverse, which is not trivially the case in the CLDR format with fractional byte-based weights rather than 16-bit integers, but they can be assigned or transformed so that it's easy, or my algorithm can just be applied to the output weight buffer (which is a lot less costly than re-running iterators that decode, decompose, reorder, and lookup collation elements). Thanks! Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.