Large regex patterns: PCRC won't do it

馋奶兔 提交于 2019-12-25 06:44:49

问题


I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.

PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416

Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.


回答1:


Could you approach the problem from the other direction?

  1. Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).

  2. Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.

  3. Loop through each word from (1), lowercase it, and then match it against your hashtable.

  4. Increment the item in your hash table when it matches.




回答2:


Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?




回答3:


You can try re2.

One of it's strengths is that uses automata theory to guarantee that the regex runs in linear time in comparison to it's input.




回答4:


You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.

$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
   return in_array($word, $keywordsArray);
});

This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.



来源:https://stackoverflow.com/questions/4989985/large-regex-patterns-pcrc-wont-do-it

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!