Find occurrences of huge list of phrases in text

后端 未结 8 2112
傲寒
傲寒 2021-02-08 05:02

I\'m building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
8条回答
  •  旧时难觅i
    2021-02-08 05:35

    The "Patricia tree" is a good solution for this kind of problem. It's sort of a radix tree with the radix being the character choices involved. So to find if "the dog" is in the tree, you start at the root, tag the "t" branch, then the "h" branch, and so on. Except Patricia trees do this really fast.

    So you spin your text through, and you can get all tree locations (=phrases) that hits. This will even get you overlapping matches if you want.

    The main article about them is Donald R. Morrison, PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric, Journal of the ACM, 15(4):514-534, October 1968. There's some discussion at https://xlinux.nist.gov/dads/HTML/patriciatree.html There are several implementations on github, though I don't know which are good.

提交回复
热议问题