Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250629150646.GU1827@brightrain.aerifal.cx>
Date: Sun, 29 Jun 2025 11:06:46 -0400
From: Rich Felker <dalias@...c.org>
To: David Wang <00107082@....com>
Cc: musl@...ts.openwall.com
Subject: Re: Re:Paintcans for reverse iterating strings

On Sun, Jun 29, 2025 at 04:30:12PM +0800, David Wang wrote:
> 
> At 2025-06-29 08:08:43, "Rich Felker" <dalias@...c.org> wrote:
> >One thing we're going to need for LC_COLLATE in locales where
> >second-level weights are applied in reverse order (diacritic marks
> >later in the string weigh more than earlier ones) is the ability to
> >traverse (& live transform to NFD) the input string in reverse.
> >
> >In order not to take quadratic time doing this, we need a strategy for
> >laying down a logarithmic (thus bounded by a small constant) number of
> >waypoints ("paintcans") we can resume forward-processing at to
> >simulate reverse-order random access.
> >
> Hi,
> Maybe some mumbo jumbo, (I have no knowledge of LC_COLLATE,  just takes it  an interesting  algorithm problem),
> but is it possible to build a "small"  finite state machine with "deterministic reverse transformations"
> for all the "diacritic marks"?
> If that possible,  first iterating forward and keep state for each "diacritic marks", and 
> when reverse iterating string, a reverse state transformation can be carried out, maybe for each "diacritic marks"
> , or could be only for current "diacritic marks".
> (But this does not work for "random" access)

The state in question is conceptually unbounded in size (within the
range of size_t). It's potentially all characters up to the current
point.

In the common case of just a small number of nonstarters between
successive starters, in theory you can process UTF-8 backwards from
the end until you find a starter, then go forward. However, you need
a method to avoid quadratic time when you don't have that common-case,
and unless a hybrid approach of doing both is a lot faster in the
common case, it's just gratuitous complexity/codesize/bug-surface.

FWIW (I'll address this elsewhere) it's actually not even iterating
normalized characters in reverse, but iterating collation elements in
reverse. This adds a good deal more complexity to trying to do a
direct reverse scanning.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.