Aho-Corasick text matching on whole words?

笑着哭i 提交于 2019-12-24 00:58:20

问题


I'm using Aho-Corasick text matching and wonder if it could be altered to match terms instead of characters. In other words, I want the the terms to be the basis of matching rather than the characters. As an example:

Search query: "He",

Sentence: "Hello world",

Aho-Corasick will match "he" to the sentence "hello world" ending at index 2, but I would prefer to have no match. So, I mean by "terms" words rather than characters.


回答1:


One way to do this would be to use Aho-Corasick as usual, then do a filtering step where you eliminate all false positives. For example, every time you find a match, you can confirm that the next and previous characters in the input are non-letter characters like spaces or punctuation. That way, you get the speed of the Aho-Corasick lookup, but only consider matches that appear as whole words in the text.

Hope this helps!




回答2:


One possibility would be to include the space character in your search term, possibly after pre-processing your input to convert all sorts of white space (space, line feed, carriage return, tab...) to the same space character.

Another possibility would be to think of the characters of your alphabet, as far as Aho-Corasick is concerned, as being words. Aho-Corasick will work just as quickly (if not more quickly) with an alphabet of size 2^32 where each word seen in the input text is encoded as a single character, as it will with an alphabet of size 2^8 where a character is just a single byte, as usual.

In either case you have to make a decision about what your pre-processing does with punctuation.




回答3:


Very late to the party, but another option is to insert some symbols into the trie that represent the start and end of words. Then, during the matching stage, they must match accordingly. I'm about to try that approach myself.



来源:https://stackoverflow.com/questions/14444738/aho-corasick-text-matching-on-whole-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!