I want to know how I could perform some kind of index on keys from a python dictionary. The dictionary holds approx. 400,000 items, so I am trying to avoid a linear search.
Basically, I am trying to find if the userinput is inside any of the dict keys.
for keys in dict:
if userinput in keys:
DoSomething()
break
That would be an example of what I am trying to do. Is there a way to search in a more direct way, without a loop ? or what would be a more efficient way.
Clarification: The userinput is not exactly what the key will be, eg userinput could be log, whereas the key is logfile
Edit: any list/cache creation, pre-processing or organisation that can be done prior to searching is acceptable. The only thing that needs to be quick is the search for the key.
If you only need to find keys that start with a prefix then you can use a trie. More complex data structures exist for finding keys that contain a substring anywhere within them, but they take up a lot more space to store so it's a space-time trade-off.
If you only need to find keys that start with a prefix then you can use a binary search. Something like this will do the job:
import bisect
words = sorted("""
a b c stack stacey stackoverflow stacked star stare x y z
""".split())
n = len(words)
print n, "words"
print words
print
tests = sorted("""
r s ss st sta stack star stare stop su t
""".split())
for test in tests:
i = bisect.bisect_left(words, test)
if words[i] < test: i += 1
print test, i
while i < n and words[i].startswith(test):
print i, words[i]
i += 1
Output:
12 words
['a', 'b', 'c', 'stacey', 'stack', 'stacked', 'stackoverflow', 'star', 'stare',
'x', 'y', 'z']
r 3
s 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
ss 3
st 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
sta 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
stack 4
4 stack
5 stacked
6 stackoverflow
star 7
7 star
8 stare
stare 8
8 stare
stop 9
su 9
t 9
No. The only way of searching for a string in dictionary keys is to look in each key. Something like what you've suggested is the only way of doing it with a dictionary.
However, if you have 400,000 records and you want to speed up your search, I'd suggest using an SQLite database. Then you can just say SELECT * FROM TABLE_NAME WHERE COLUMN_NAME LIKE '%userinput%';. Look at the documentation for Python's sqlite3 module here.
Another option is to use a generator expression, as these are almost always faster than the equivalent for loops.
filteredKeys = (key for key in myDict.keys() if userInput in key)
for key in filteredKeys:
doSomething()
EDIT: If, as you say, you don't care about one-time costs, use a database. SQLite should do what you want damn near perfectly.
I did some benchmarks, and to my surprise, the naive algorithm is actually twice as fast as a version using list comprehensions and six times as fast as a SQLite-driven version. In light of these results, I'd have to go with @Mark Byers and recommend a Trie. I've posted the benchmark below, in case someone wants to give it a go.
import random, string, os
import time
import sqlite3
def buildDict(numElements):
aDict = {}
for i in xrange(numElements-10):
aDict[''.join(random.sample(string.letters, 6))] = 0
for i in xrange(10):
aDict['log'+''.join(random.sample(string.letters, 3))] = 0
return aDict
def naiveLCSearch(aDict, searchString):
filteredKeys = [key for key in aDict.keys() if searchString in key]
return filteredKeys
def naiveSearch(aDict, searchString):
filteredKeys = []
for key in aDict:
if searchString in key:
filteredKeys.append(key)
return filteredKeys
def insertIntoDB(aDict):
conn = sqlite3.connect('/tmp/dictdb')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS BLAH')
c.execute('CREATE TABLE BLAH (KEY TEXT PRIMARY KEY, VALUE TEXT)')
for key in aDict:
c.execute('INSERT INTO BLAH VALUES(?,?)',(key, aDict[key]))
return conn
def dbSearch(conn):
cursor = conn.cursor()
cursor.execute("SELECT KEY FROM BLAH WHERE KEY GLOB '*log*'")
return [record[0] for record in cursor]
if __name__ == '__main__':
aDict = buildDict(400000)
conn = insertIntoDB(aDict)
startTimeNaive = time.time()
for i in xrange(3):
naiveResults = naiveSearch(aDict, 'log')
endTimeNaive = time.time()
print 'Time taken for 3 iterations of naive search was', (endTimeNaive-startTimeNaive), 'and the average time per run was', (endTimeNaive-startTimeNaive)/3.0
startTimeNaiveLC = time.time()
for i in xrange(3):
naiveLCResults = naiveLCSearch(aDict, 'log')
endTimeNaiveLC = time.time()
print 'Time taken for 3 iterations of naive search with list comprehensions was', (endTimeNaiveLC-startTimeNaiveLC), 'and the average time per run was', (endTimeNaiveLC-startTimeNaiveLC)/3.0
startTimeDB = time.time()
for i in xrange(3):
dbResults = dbSearch(conn)
endTimeDB = time.time()
print 'Time taken for 3 iterations of DB search was', (endTimeDB-startTimeDB), 'and the average time per run was', (endTimeDB-startTimeDB)/3.0
os.remove('/tmp/dictdb')
For the record, my results were:
Time taken for 3 iterations of naive search was 0.264658927917 and the average time per run was 0.0882196426392
Time taken for 3 iterations of naive search with list comprehensions was 0.403481960297 and the average time per run was 0.134493986766
Time taken for 3 iterations of DB search was 1.19464492798 and the average time per run was 0.398214975993
All times are in seconds.
You could join all the keys into one long string with a suitable separator character and use the find method of the string. That is pretty fast.
Perhaps this code is helpful to you. The search method returns a list of dictionary values whose keys contain the substring key.
class DictLookupBySubstr(object):
def __init__(self, dictionary, separator='\n'):
self.dic = dictionary
self.sep = separator
self.txt = separator.join(dictionary.keys())+separator
def search(self, key):
res = []
i = self.txt.find(key)
while i >= 0:
left = self.txt.rfind(self.sep, 0, i) + 1
right = self.txt.find(self.sep, i)
dic_key = self.txt[left:right]
res.append(self.dic[dic_key])
i = self.txt.find(key, right+1)
return res
dpath can solve this for you easily.
http://github.com/akesterson/dpath-python
$ easy_install dpath
>>> for (path, value) in dpath.util.search(MY_DICT, "glob/to/start/{}".format(userinput), yielded=True):
>>> ... # (do something with the path and value)
You can pass an eglob ('path//to//something/[0-9a-z]') for advanced searching.
Perhaps using has_key solve this too.
来源:https://stackoverflow.com/questions/5174506/search-of-dictionary-keys-python