I need information about any standard python package which can be used for \"longest prefix match\" on URLs. I have gone through the two standard packages http://packages.py
If you are willing to trade RAM for the time performance then SuffixTree might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.
If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating SubstringDict():
from SuffixTree import SubstringDict
substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
assert '\n' not in url
substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url
def longest_match(url_prefix, _substr_dict=substr_dict):
matches = _substr_dict['\n'+url_prefix]
return max(matches, key=len) if matches else ''
Such usage of SuffixTree seems suboptimal but it is 20-150 times faster (without SubstringDict()'s construction time) than @StephenPaulger's solution [which is based on .startswith()] on the data I've tried and it could be good enough.
To install SuffixTree, run:
pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees