Fast algorithm to extract thousands of simple patterns out of large amounts of text

帅比萌擦擦* 提交于 2019-12-01 06:26:52

You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.

[ I assume grep -F does about the same ]

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%

"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again"  ); }


"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!"  ); } 

... 
more lines
...

[bd-fh-su-z]+ {;}

[ \t\r\n]+ {;}

. {;}

%%

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

A script like the one above can be generated form txt input with awk or any other script language.

A slightly smarter implementation than running every regex on every file:

For each regex:
    load regex into a regex engine
    assemble a list of regex engines
For each byte in the file:
    insert byte to every regex engine
      print results if there are matches

But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes

I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:

((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))

If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage

If you need really fast implementation of some specific case, you can implement suffix tree of Aho–Corasick algorithm yourself. But in most cases union of all your regexes into single regex, as recommended earlier, will be not bad too

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!