![]() |
|
Message-ID: <20250629151438.GV1827@brightrain.aerifal.cx> Date: Sun, 29 Jun 2025 11:14:38 -0400 From: Rich Felker <dalias@...c.org> To: Joakim Sindholt <opensource@...sha.com> Cc: musl@...ts.openwall.com, Alexander Monakov <amonakov@...ras.ru> Subject: Re: Paintcans for reverse iterating strings On Sun, Jun 29, 2025 at 12:39:03PM +0200, Joakim Sindholt wrote: > On Sun, 29 Jun 2025 10:58:11 +0300 (MSK), Alexander Monakov <amonakov@...ras.ru> wrote: > > > > On Sat, 28 Jun 2025, Rich Felker wrote: > > > > > One thing we're going to need for LC_COLLATE in locales where > > > second-level weights are applied in reverse order (diacritic marks > > > later in the string weigh more than earlier ones) is the ability to > > > traverse (& live transform to NFD) the input string in reverse. > > > > Apologies if I'm forgetting some essential context, but what is the > > encoding of the input string, is it not always UTF-8? Reverse iteration > > of valid UTF-8 is easy (unless you need something more specific than > > just reverse iteration). > > As I understand the problem it's not a matter of iterating UTF-8 in > reverse but the normalized NFD codepoint stream it transforms into in > reverse. Yes, it's actually worse. It's the collation elements obtained from forward traversal of the normalized NDF codepoint stream, which need to be iterated in reverse. So there are like 4 levels of stacked iteration: UTF-8 decoding -> decomposition -> canonical reordering -> collation elements That last arrow is a multi-input (greedy), multi-output (individual characters or sequences of characters can map to more than one collation element). In theory it has its own "position within the output", but since it's always the last step, it doesn't actually have to be implemented as an iterator. The calling code could just iterate (forwards or backwards as needed) over the array of elements obtained from a given [sequence of] input[s]. There are only a finite number of such arrays, of bounded size; they're not dynamically generated but part of mmapped data. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.