![]() |
|
Message-ID: <20250629150646.GU1827@brightrain.aerifal.cx> Date: Sun, 29 Jun 2025 11:06:46 -0400 From: Rich Felker <dalias@...c.org> To: David Wang <00107082@....com> Cc: musl@...ts.openwall.com Subject: Re: Re:Paintcans for reverse iterating strings On Sun, Jun 29, 2025 at 04:30:12PM +0800, David Wang wrote: > > At 2025-06-29 08:08:43, "Rich Felker" <dalias@...c.org> wrote: > >One thing we're going to need for LC_COLLATE in locales where > >second-level weights are applied in reverse order (diacritic marks > >later in the string weigh more than earlier ones) is the ability to > >traverse (& live transform to NFD) the input string in reverse. > > > >In order not to take quadratic time doing this, we need a strategy for > >laying down a logarithmic (thus bounded by a small constant) number of > >waypoints ("paintcans") we can resume forward-processing at to > >simulate reverse-order random access. > > > Hi, > Maybe some mumbo jumbo, (I have no knowledge of LC_COLLATE, just takes it an interesting algorithm problem), > but is it possible to build a "small" finite state machine with "deterministic reverse transformations" > for all the "diacritic marks"? > If that possible, first iterating forward and keep state for each "diacritic marks", and > when reverse iterating string, a reverse state transformation can be carried out, maybe for each "diacritic marks" > , or could be only for current "diacritic marks". > (But this does not work for "random" access) The state in question is conceptually unbounded in size (within the range of size_t). It's potentially all characters up to the current point. In the common case of just a small number of nonstarters between successive starters, in theory you can process UTF-8 backwards from the end until you find a starter, then go forward. However, you need a method to avoid quadratic time when you don't have that common-case, and unless a hybrid approach of doing both is a lot faster in the common case, it's just gratuitous complexity/codesize/bug-surface. FWIW (I'll address this elsewhere) it's actually not even iterating normalized characters in reverse, but iterating collation elements in reverse. This adds a good deal more complexity to trying to do a direct reverse scanning. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.