Efficient string matching algorithm

后端 未结 14 877
Happy的楠姐
Happy的楠姐 2020-12-16 07:13

I\'m trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

14条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-16 08:03

    If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.

    Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.

    The only differences between this and a regular spell-checking tree are:

    1. The matching needs to be done in reverse
    2. You have to take into account the wildcards

    To address point #2, you would simply check for the "*" character at the end of a test.

    A quick example:

    Entries:

    *.fark.com
    www.cnn.com
    

    Tree:

    m -> o -> c -> . -> k -> r -> a -> f -> . -> *
                    \
                     -> n -> n -> c -> . -> w -> w -> w
    

    Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.

    Checking www.cern.com would fail on the second "n" of n,n,c,...

    Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

提交回复
热议问题