What are good test cases for benchmarking & stress testing substring search algorithms?

拟墨画扇 提交于 2019-12-12 09:39:07

问题


I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...


回答1:


Some thoughts and a partial answer to myself:

Worst case for brute force algorithm:

a^(n+1) b in (a^n b)^m

e.g. aaab in aabaabaabaabaabaabaab

Worst case for SMOA:

Something like yxyxyxxyxyxyxx in (yxyxyxxyxyxyxy)^n. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5) run slower than glibc's Two-Way (which is asymptotically 2n-m but has moderately painful preprocessing overhead).

Worst case for anything rolling-hash based:

Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.

Worst case for Two-Way:

Seems to be very short needle with nontrivial MS decomposition - something like bac - where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).

Other algorithms:

I'm really only interested in algorithms that are O(1) in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)) has a nontrivial worst-case where it becomes O(nm).




回答2:


Doesn't answer your question directly, but you may find the algorithms in the book - Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology - interesting (has many novel algorithms on sub-string search). Additionally, it is also a good source of special and complex cases.




回答3:


A procedure that might give interesting statistics, though I have no time to test right now:

Randomize over string length, then randomize over string contents of that length, then randomize over offset/length of a substring (possibly something not in the string), then randomily clobber over the substring (possibly not at all), repeat.




回答4:


You can generate container strings (resp., contained test values) recursively by:

Starting with the empty string, generate all strings given by the augmentation of a string currently in the set by adding a character from an alphabet to the left or the right (both).

The alphabet for generating container strings is chosen by you.

You test 2 alphabets for contained strings. One is the one that makes up container strings, the other is its complement.



来源:https://stackoverflow.com/questions/3134602/what-are-good-test-cases-for-benchmarking-stress-testing-substring-search-algo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!