Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250629151438.GV1827@brightrain.aerifal.cx>
Date: Sun, 29 Jun 2025 11:14:38 -0400
From: Rich Felker <dalias@...c.org>
To: Joakim Sindholt <opensource@...sha.com>
Cc: musl@...ts.openwall.com, Alexander Monakov <amonakov@...ras.ru>
Subject: Re: Paintcans for reverse iterating strings

On Sun, Jun 29, 2025 at 12:39:03PM +0200, Joakim Sindholt wrote:
> On Sun, 29 Jun 2025 10:58:11 +0300 (MSK), Alexander Monakov <amonakov@...ras.ru> wrote:
> > 
> > On Sat, 28 Jun 2025, Rich Felker wrote:
> > 
> > > One thing we're going to need for LC_COLLATE in locales where
> > > second-level weights are applied in reverse order (diacritic marks
> > > later in the string weigh more than earlier ones) is the ability to
> > > traverse (& live transform to NFD) the input string in reverse.
> > 
> > Apologies if I'm forgetting some essential context, but what is the
> > encoding of the input string, is it not always UTF-8? Reverse iteration
> > of valid UTF-8 is easy (unless you need something more specific than
> > just reverse iteration).
> 
> As I understand the problem it's not a matter of iterating UTF-8 in
> reverse but the normalized NFD codepoint stream it transforms into in
> reverse.

Yes, it's actually worse. It's the collation elements obtained from
forward traversal of the normalized NDF codepoint stream, which need
to be iterated in reverse. So there are like 4 levels of stacked
iteration:

    UTF-8 decoding -> decomposition -> canonical reordering ->
    collation elements

That last arrow is a multi-input (greedy), multi-output (individual
characters or sequences of characters can map to more than one
collation element). In theory it has its own "position within the
output", but since it's always the last step, it doesn't actually have
to be implemented as an iterator. The calling code could just iterate
(forwards or backwards as needed) over the array of elements obtained
from a given [sequence of] input[s]. There are only a finite number of
such arrays, of bounded size; they're not dynamically generated but
part of mmapped data.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.