Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250629035154.GT1827@brightrain.aerifal.cx>
Date: Sat, 28 Jun 2025 23:51:56 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Paintcans for reverse iterating strings

On Sat, Jun 28, 2025 at 08:08:43PM -0400, Rich Felker wrote:
> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.
> 
> In order not to take quadratic time doing this, we need a strategy for
> laying down a logarithmic (thus bounded by a small constant) number of
> waypoints ("paintcans") we can resume forward-processing at to
> simulate reverse-order random access.
> 
> I'm posting here a proposal for comments on whether this seems fairly
> optimal or has any obvious improvements:
> 
> 
> Maintain an array of up to N waypoints, initially populated just by
> the halfway point.
> 
> When requesting a position, first remove any waypoints past the
> requested position from the array, then start from the last remaining
> waypoint and work forward. When crossing the halfway point between the
> last waypoint and the desired position, if the waypoint array is not
> already full, add the halfway position as a waypoint.
> 
> 
> With unlimited N, I'm pretty sure this gives O(N log N) total
> traversal time. In a finite real-machine model, N=8*sizeof(size_t)
> then gives the same. And if we know ahead of time that the size of the
> input is way less than the maximum possible length (note: total length
> is known from a previous pass), we can pre-populate linear-spaced
> early waypoints or space them differently (like 1/4 instead of 1/2 the
> way to the target) so that less time is spent backtracking (I haven't
> yet thought much about which is best).
> 
> Note that N=64 is no problem in terms of storage space; it's under 4k.
> And the majority would not even get used/touched except on
> pathologically long inputs to strcoll/strxfrm, and only in the case of
> a locale with reversed second-level weights.

At least empirically this seems to work. I'm getting expected output
and for a 21-character string it performs the forward-iterate
operation 63 times, which is not bad. This is without any logic to
emit extra waypoints when log(len) is small and only a few slots are
needed for the halfway steps.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.