
Date: Sat, 17 Nov 2018 17:09:38 0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Further strstr findings I've been doing some additional reading on the topic, mainly from the following sources which I'm citing here for reference: [1] Galil and Seiferas. TimeSpaceOptimal String Matching. https://urresearch.rochester.edu/fileDownloadForInstitutionalItem.action?itemId=10186&itemFileId=22371 [2] Breslauer. Saving Comparisons in the CrochemorePerrin String Matching Algorithm. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.6641&rep=rep1&type=pdf [3] Breslauer, Grossi, and Mignosi. Simple realtime constantspace string matching. https://www.sciencedirect.com/science/article/pii/S0304397512010900/pdf?md5=1bc807e14b362af266ff04b78fa8c4df&pid=1s2.0S0304397512010900main.pdf&_valck=1 In particular, [1] offers a bigOequivalent of TwoWay a decade earlier, with the same basic searchphase concept but a different (and probably more expensive, though I'm not sure) factorization approach. [2] is interesting in that it provides an approach to optimize the skip forward to prevent repeated comparisons. However, I think the academic work in this area is mostly misplaced. The interesting realworld problem does not seem to be further optimizing pathological needles with nasty periodicity properties. We already assure nonpathologicallybad performance for them by virtue of TwoWay being O(n) with a bound of roughly 2n comparisons. Rather, the interesting problem is avoiding penalizing common realworld nonperiodic (or minimally periodic) needles for the sake of ensuring O(n) performance even in the worst case. One of the things that always struck me as ugly about TwoWay is that the factorization that pops out depends on a choice of ordering on the alphabet. In the nonperiodic needle case, the lowerbound on the period in turn depends on the factorization, which means the performance depens on the factorization, which in turn depended on an arbitrary choice of ordering. As an example, assuming an alphabet [az] in the standard order, taking a nonperiodic needle of length m in which "a" and "z" do not appear and inserting "az" in the middle of it ensures that the shorter maximal suffix will have length m/2+1 and that the estimated period will be m/2+2, vs the real period of m+2. This inspires a motivation to make the choice of order on the alphabet nonarbitrary, and tailor it to achieving an optimal factorization. So far the most promising approach I've found is to order the alphabet according to first appearance in the needle. For typical realworld needles, this results in the whole needle being its own maximal suffix in one direction (in which case, the exact period of the whole needle falls out as a side effect, allowing optimal skipforward if this period is kept rather than using the estimate), and a very short maximal suffix in the other direction, which usually becomes the right factor  the length of this maximal suffix is bounded by the distance of the last first appearance of a new character from the end of the needle. In particular, if the last character is one that has not appeared before in the needle, the needle factors into components of length m1 and 1, and strchr can be used to search for the right factor. This choice of order is still only heuristic; in particular, it does not help when the tail of the needle is all repetitions of characters that appeared early in the needle, which includes all the canonical pathological cases that can be realized with small (e.g. just 2character) alphabets. However it still does have the nice property of being invariant under permutations of the alphabet. Note that the special case where the right factor will have length 1 can be detected early, allowing the whole MSdecomposition to be skipped. When building the order for the alphabet, if the last slot of the needle is character that was not seen before, period=m and suffix_pos=m1 can be inferred automatically. This does nothing for search phase but it does save on setup time. Rich
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.