most efficient way to find partial string matches in large file of strings (python)

大憨熊 提交于 2019-12-01 05:23:18

Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".

If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.

This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".

I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!