Aho Corasick algorithm

こ雲淡風輕ζ 提交于 2019-12-18 12:37:44

问题


I am not able to understand the below algorithm which is used for string pattern matching using Aho-Corasick alg.

Procedure AC(y,n,q0)
INPUT: y<-array of m bytes representing the text input
(SQL Query Statement)
n<-integer representing the text length
(SQL Query Length)
q0<-initial state (first character in pattern)
2: State <-q0
3: For i = 1 to n do
4: While g ( State, y[i] = = fail) do
5: State ← f (State)
6: End While
7: State ← g(State,.y[i])
8: If o(State) 􀂏 then
9: Output i
10: Else
11: Output
12: End If
13: End for
14: End Procedure

回答1:


You probably won't gain a good understanding of the Aho-Corasick algorithm from reading a little bit of pseudocode. Unless you understand the state transition table, the algorithm will make no sense at all.

There's a decent explanation along with an animation at Aho-Corasick implementation and animation.

The original paper, Efficient String Matching: An Aid to Bibliographic Search(PDF), is well written and understandable, and the pseudocode examples are pretty easy to convert to working code. It'll take a little study, but you should have a good understanding after you read the paper, think about it a bit, and then read it again.




回答2:


Aho Corasick algorithm is used to solve set matching problem. It means we have a set of strings S, and here comes a long string L to check whether L contains any one in the previous set S.

An basic solution is using a trie tree, i.e. a prefix tree, please see Wikipedia. There are typically two steps to deal with the problem.

  1. Precompute the trie tree based on the string set S;
  2. Scan the string L to check.

The trie tree is easy to understand. It store prefix substrings with nodes starts from the root.

The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.

When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).

Below is a C++ implementation of the Aho-Corasick algorithm. It contains some bugs. http://www.komodia.com/aho-corasick

I fixed the bugs I found. And you can access my version here: https://github.com/elfinhe/KomodiaAhoCorasick



来源:https://stackoverflow.com/questions/20328989/aho-corasick-algorithm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!