发表新帖

发表新帖

Find occurrences of huge list of phrases in text

后端未结

关注

 8  2177

傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

The clients submit text to the backend (around 2000 characters on average)

8条回答

旧时难觅i (楼主)

2021-02-08 05:35

The "Patricia tree" is a good solution for this kind of problem. It's sort of a radix tree with the radix being the character choices involved. So to find if "the dog" is in the tree, you start at the root, tag the "t" branch, then the "h" branch, and so on. Except Patricia trees do this really fast.

So you spin your text through, and you can get all tree locations (=phrases) that hits. This will even get you overlapping matches if you want.

The main article about them is Donald R. Morrison, PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, Journal of the ACM, 15(4):514-534, October 1968. There's some discussion at https://xlinux.nist.gov/dads/HTML/patriciatree.html There are several implementations on github, though I don't know which are good.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

热议问题