Why does String.indexOf() not use KMP?

岁酱吖の 提交于 2020-07-14 16:34:47

问题


I read the source code of java.lang.String and I was surprised to find that String.indexof() does not use the Knuth–Morris–Pratt algorithm? As we know, KMP is more effective. So why isn't it used in String.indexOf()?

Someone around me told me that for short string KMP is good enough, but if you need performance and you intend to use with large strings then is not a good choice. However he didn't tell me the details.

So, here are my questions:

  1. why don't we use KMP in String.indexOf()?
  2. why is KMP not a good choice with large Strings?

回答1:


KMP has better worst-case performance, but actually requires a little bit of up-front computation (to generate the table of offsets). It also requires an initial memory allocation, which could also impact performance.

For (presumably) common use-cases of searching in relatively short strings, this might actually end up slower than the primitive implementation.

This, bundled with the fact that for really huge data sets you will probably be using more specialized data structures than a simple String means that the increased implementation (and possibly runtime) cost is not worth investing.

Note that this might change in future Java versions, as the actual algorithm is not specified.




回答2:


KMP and several other asymptotically efficient string search methods like Boyer-Moore and Boyer-Moore-Horspool require extra memory -- in the case of KMP, O(m) memory, where m is the size of the substring being searched for. Although this is often acceptable, library designers have to make tradeoffs so that their code performs acceptably well in many different situations. Probably the main reason is that due to both the preprocessing required by KMP, and its more complex inner loop in the search phase, the constant factor slowdown may make it several times slower than the naive O(mn) substring search in many common cases (e.g. searching for a substring of < 10 characters in a long string). Also, someone searching for a large substring might be perplexed to find the runtime library running out of memory as it tries to allocate a large memory buffer for the KMP fallback function table.

Perhaps a better question is why O(m+n)-time, O(1)-space algorithms like the Two-Way Algorithm have not yet been adopted by mainstream language runtime libraries. Again, the answer is likely to be the constant-factor slowdown in common cases. Nevertheless in at least one C runtime library implementation, the corresponding strstr() function has been updated to use this algorithm.

Someone around me told me that for short string KMP is good enough, but if you need performance and you intend to use with large string then is not a good choice.

Well, that's exactly backwards from my understanding, which is that the naive O(mn) substring search is good enough (and probably the best) for short strings, but will eventually lose out to asymptotically faster O(m+n) algorithms like KMP as the strings become longer.



来源:https://stackoverflow.com/questions/19543547/why-does-string-indexof-not-use-kmp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!