Longest Prefix Matches for URLs

前端 未结 4 2032
醉话见心
醉话见心 2020-12-13 01:21

I need information about any standard python package which can be used for \"longest prefix match\" on URLs. I have gone through the two standard packages http://packages.py

4条回答
  •  感动是毒
    2020-12-13 02:03

    If you are willing to trade RAM for the time performance then SuffixTree might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.

    If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating SubstringDict():

    from SuffixTree import SubstringDict
    
    substr_dict = SubstringDict()
    for url in URLS: # urls must be ascii (valid urls are)
        assert '\n' not in url
        substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url
    
    def longest_match(url_prefix, _substr_dict=substr_dict):
        matches = _substr_dict['\n'+url_prefix]
        return max(matches, key=len) if matches else ''
    

    Such usage of SuffixTree seems suboptimal but it is 20-150 times faster (without SubstringDict()'s construction time) than @StephenPaulger's solution [which is based on .startswith()] on the data I've tried and it could be good enough.

    To install SuffixTree, run:

    pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees
    

提交回复
热议问题