![]() |
|
Message-ID: <20250630151855.GZ1827@brightrain.aerifal.cx> Date: Mon, 30 Jun 2025 11:18:55 -0400 From: Rich Felker <dalias@...c.org> To: Nick Wellnhofer <wellnhofer@...um.de> Cc: musl@...ts.openwall.com Subject: Re: Paintcans for reverse iterating strings On Sun, Jun 29, 2025 at 12:48:23PM -0400, Rich Felker wrote: > On Sun, Jun 29, 2025 at 06:25:15PM +0200, Nick Wellnhofer wrote: > > > > > > > On Jun 29, 2025, at 18:03, Rich Felker <dalias@...c.org> wrote: > > > > > > On Sun, Jun 29, 2025 at 05:39:14PM +0200, Nick Wellnhofer wrote: > > >> On Jun 29, 2025, at 02:08, Rich Felker <dalias@...c.org> wrote: > > >>> One thing we're going to need for LC_COLLATE in locales where > > >>> second-level weights are applied in reverse order (diacritic marks > > >>> later in the string weigh more than earlier ones) is the ability to > > >>> traverse (& live transform to NFD) the input string in reverse. > > >> > > >> Assuming the context is strcoll and we're comparing two strings, > > >> wouldn't it be possible to compare the strings in normal, forward > > >> direction but instead of stopping at the first difference, comparing > > >> all collation elements and returning the last difference (if any)? > > > > > > I believe you can do something like this for strcoll. Note that, > > > normally, you don't even get to second level weights when using > > > strcoll. > > > > > > Where you can't do it is strxfrm (transforming into a byte sequence > > > that can be byte-by-byte compared). > > > > For strxfrm, I assume you have to perform Step 3 of the UCA main > > algorithm or something equivalent: > > https://www.unicode.org/reports/tr10/#Step_3 > > > > So you just have to reverse the second-level weights afterwards > > which seems trivial. Am I missing something? > > Hm, indeed maybe you can do the reversal in-place in the destination > buffer. I didn't think of that because I was assuming you needed to be > able to do the same for strcoll where you don't have a destination > buffer, but since you've found that it doesn't need to be done at all > for strcoll (because you can just keep a running last-difference and > use the final one) it seems like you always have a buffer you can do > the reversal in. > > Note that this does require the weights themselves to be iterable in > reverse, which is not trivially the case in the CLDR format with > fractional byte-based weights rather than 16-bit integers, but they > can be assigned or transformed so that it's easy, or my algorithm can > just be applied to the output weight buffer (which is a lot less > costly than re-running iterators that decode, decompose, reorder, and > lookup collation elements). OK, this admits a trivial solution that works regardless of the encoding of the weights, without any need for being synchronizable for reverse traversal. When emitting the second-level weights during the first pass (before reversing them), reverse the bytes for each collating element on a byte-by-byte basis. Then, the entire span of bytes in the output corresponding to the second level weights can be reversed, byte-by-byte without any consideration for where the collating element boundaries are, to produce a final bytestring consisting of collating elements in reverse order. Moreover, since "second order is reversed" is a property of the locale that's fixed at the time the locale file is generated, the reversal first reversal could even have taken place in the on-disk format, so that it doesn't have to be processed at runtime. I'm not sure yet if this is a good idea when strcoll is probably the more-likely case that needs to perform well, but it could be considered. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.