aho-corasick

Algorithmic way to search a list of tuples for a matching substring?

断了今生、忘了曾经 提交于 2021-01-27 12:43:31
问题 I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat. tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...] vals = ['cheese', 'flees'] ids = {i[0] for i in tuples if any(val in i[1] for val in vals)} output: {id1} Is there an algorithm that would allow doing that quicker? I'm

State transition table for aho corasick algorithm

喜你入骨 提交于 2020-01-01 06:54:34
问题 Please help me to understand the construction of state transition table for more than one patterns in the Aho-Corasick algorithm. Please give a simple and detailed explanation so that I could understand. I am following this paper and here is the animation for that. Thanks. 回答1: Phase 1 Creating the keyword tree : Starting at the root, follow the path labeled by chars of P i If the path ends before P i , continue it by adding new edges and ... nodes for the remaining characters of P Store

Aho-Corasick text matching on whole words?

笑着哭i 提交于 2019-12-24 00:58:20
问题 I'm using Aho-Corasick text matching and wonder if it could be altered to match terms instead of characters. In other words, I want the the terms to be the basis of matching rather than the characters. As an example: Search query: "He", Sentence: "Hello world", Aho-Corasick will match "he" to the sentence "hello world" ending at index 2, but I would prefer to have no match. So, I mean by "terms" words rather than characters. 回答1: One way to do this would be to use Aho-Corasick as usual, then

Using Aho-Corasick, can strings be added after the initial tree is built?

佐手、 提交于 2019-12-22 01:29:52
问题 I want to search for strings inside a large number of documents. I have a predefined list of strings available that I want to find in each document. Each document contains a header at the beginning followed by the text and in the header are additional strings I want to search for in the text below the header. On each iteration of document, is it possible to add the header strings after creating the initial tree that was made from the main list? Or modify the original data structure to include

Knuth-Morris-Pratt algorithm in Haskell

时光怂恿深爱的人放手 提交于 2019-12-21 07:38:45
问题 I have a trouble with understanding this implementation of the Knuth-Morris-Pratt algorithm in Haskell. http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell In particular I don't understand the construction of the automaton. I know that it uses the "Tying the Knot" method to construct it, but it isn't clear to me and I also don't know why it should have the right complexity. Another thing I would like to know is whether you think that this implementation could be easily generalized to

Aho Corasick algorithm

こ雲淡風輕ζ 提交于 2019-12-18 12:37:44
问题 I am not able to understand the below algorithm which is used for string pattern matching using Aho-Corasick alg. Procedure AC(y,n,q0) INPUT: y<-array of m bytes representing the text input (SQL Query Statement) n<-integer representing the text length (SQL Query Length) q0<-initial state (first character in pattern) 2: State <-q0 3: For i = 1 to n do 4: While g ( State, y[i] = = fail) do 5: State ← f (State) 6: End While 7: State ← g(State,.y[i]) 8: If o(State) 􀂏 then 9: Output i 10: Else 11:

Updating an Aho-Corasick trie in the face of inserts and deletes

喜你入骨 提交于 2019-12-13 12:17:20
问题 All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic. From how I figure it, the fail state for each node is another

Aho-Corasick-like algorithm for use in anti-malware code

帅比萌擦擦* 提交于 2019-12-06 03:58:23
问题 Is there an algorithm like Aho-Corasick, which can match a set of patterns simultaneously and is applicable to be used in anti-malware comparison? Do all known commercial antivirus software use the Aho-Corasick algorithm? What are the advantages of the Aho-Corasick algorithm over Boyer-Moore? 回答1: Boyer-Moore : For searching one string in another target string Aho-Corasick : For searching multiple patterns simultaneously So the advantage being that Aho-Corasick is optimal if you want to

Faster Aho-Corasick PHP implementation

纵然是瞬间 提交于 2019-12-04 23:15:46
问题 Is there a working implementation of Aho–Corasick in PHP? There is one Aho-Corasick string matching in PHP mentioned on the Wikipedia article: <?php /* This class performs a multiple pattern matching by using the Aho-Corasick algorythm, which scans text and matches all patterns "at once". This class can: - find if any of the patterns occours inside the text - find all occourrences of the patterns inside the text - substitute all occourrences of the patterns with a specified string (empty as

Using Aho-Corasick, can strings be added after the initial tree is built?

混江龙づ霸主 提交于 2019-12-04 19:32:39
I want to search for strings inside a large number of documents. I have a predefined list of strings available that I want to find in each document. Each document contains a header at the beginning followed by the text and in the header are additional strings I want to search for in the text below the header. On each iteration of document, is it possible to add the header strings after creating the initial tree that was made from the main list? Or modify the original data structure to include the new strings? If this is not practical to do, is there an alternative search method that would be