musl - Further strstr findings

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20181117220938.GN5150@brightrain.aerifal.cx>
Date: Sat, 17 Nov 2018 17:09:38 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Further strstr findings

I've been doing some additional reading on the topic, mainly from the
following sources which I'm citing here for reference:

[1] Galil and Seiferas. Time-Space-Optimal String Matching.
https://urresearch.rochester.edu/fileDownloadForInstitutionalItem.action?itemId=10186&itemFileId=22371

[2] Breslauer. Saving Comparisons in the Crochemore-Perrin String Matching Algorithm.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.6641&rep=rep1&type=pdf

[3] Breslauer, Grossi, and Mignosi. Simple real-time constant-space string matching.
https://www.sciencedirect.com/science/article/pii/S0304397512010900/pdf?md5=1bc807e14b362af266ff04b78fa8c4df&pid=1-s2.0-S0304397512010900-main.pdf&_valck=1

In particular, [1] offers a big-O-equivalent of Two-Way a decade
earlier, with the same basic search-phase concept but a different (and
probably more expensive, though I'm not sure) factorization approach.
[2] is interesting in that it provides an approach to optimize the
skip forward to prevent repeated comparisons.

However, I think the academic work in this area is mostly misplaced.
The interesting real-world problem does not seem to be further
optimizing pathological needles with nasty periodicity properties. We
already assure non-pathologically-bad performance for them by virtue
of Two-Way being O(n) with a bound of roughly 2n comparisons. Rather,
the interesting problem is avoiding penalizing common real-world
non-periodic (or minimally periodic) needles for the sake of ensuring
O(n) performance even in the worst case.

One of the things that always struck me as ugly about Two-Way is that
the factorization that pops out depends on a choice of ordering on the
alphabet. In the non-periodic needle case, the lower-bound on the
period in turn depends on the factorization, which means the
performance depens on the factorization, which in turn depended on an
arbitrary choice of ordering. As an example, assuming an alphabet
[a-z] in the standard order, taking a non-periodic needle of length m
in which "a" and "z" do not appear and inserting "az" in the middle of
it ensures that the shorter maximal suffix will have length m/2+1 and
that the estimated period will be m/2+2, vs the real period of m+2.
This inspires a motivation to make the choice of order on the alphabet
non-arbitrary, and tailor it to achieving an optimal factorization.

So far the most promising approach I've found is to order the alphabet
according to first appearance in the needle. For typical real-world
needles, this results in the whole needle being its own maximal suffix
in one direction (in which case, the exact period of the whole needle
falls out as a side effect, allowing optimal skip-forward if this
period is kept rather than using the estimate), and a very short
maximal suffix in the other direction, which usually becomes the right
factor -- the length of this maximal suffix is bounded by the distance
of the last first appearance of a new character from the end of the
needle. In particular, if the last character is one that has not
appeared before in the needle, the needle factors into components of
length m-1 and 1, and strchr can be used to search for the right
factor.

This choice of order is still only heuristic; in particular, it does
not help when the tail of the needle is all repetitions of characters
that appeared early in the needle, which includes all the canonical
pathological cases that can be realized with small (e.g. just
2-character) alphabets. However it still does have the nice property
of being invariant under permutations of the alphabet.

Note that the special case where the right factor will have length 1
can be detected early, allowing the whole MS-decomposition to be
skipped. When building the order for the alphabet, if the last slot of
the needle is character that was not seen before, period=m and
suffix_pos=m-1 can be inferred automatically. This does nothing for
search phase but it does save on setup time.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.