![]() |
|
Message-ID: <20250629035154.GT1827@brightrain.aerifal.cx> Date: Sat, 28 Jun 2025 23:51:56 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Paintcans for reverse iterating strings On Sat, Jun 28, 2025 at 08:08:43PM -0400, Rich Felker wrote: > One thing we're going to need for LC_COLLATE in locales where > second-level weights are applied in reverse order (diacritic marks > later in the string weigh more than earlier ones) is the ability to > traverse (& live transform to NFD) the input string in reverse. > > In order not to take quadratic time doing this, we need a strategy for > laying down a logarithmic (thus bounded by a small constant) number of > waypoints ("paintcans") we can resume forward-processing at to > simulate reverse-order random access. > > I'm posting here a proposal for comments on whether this seems fairly > optimal or has any obvious improvements: > > > Maintain an array of up to N waypoints, initially populated just by > the halfway point. > > When requesting a position, first remove any waypoints past the > requested position from the array, then start from the last remaining > waypoint and work forward. When crossing the halfway point between the > last waypoint and the desired position, if the waypoint array is not > already full, add the halfway position as a waypoint. > > > With unlimited N, I'm pretty sure this gives O(N log N) total > traversal time. In a finite real-machine model, N=8*sizeof(size_t) > then gives the same. And if we know ahead of time that the size of the > input is way less than the maximum possible length (note: total length > is known from a previous pass), we can pre-populate linear-spaced > early waypoints or space them differently (like 1/4 instead of 1/2 the > way to the target) so that less time is spent backtracking (I haven't > yet thought much about which is best). > > Note that N=64 is no problem in terms of storage space; it's under 4k. > And the majority would not even get used/touched except on > pathologically long inputs to strcoll/strxfrm, and only in the case of > a locale with reversed second-level weights. At least empirically this seems to work. I'm getting expected output and for a 21-character string it performs the forward-iterate operation 63 times, which is not bad. This is without any logic to emit extra waypoints when log(len) is small and only a few slots are needed for the halfway steps. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.