问题
I need a faster way to store and access around 3GB of k:v
pairs. Where k
is a string
or an integer
and v
is an np.array()
that can be of different shapes.
Is there any object, that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame
?
As far I have understood python dict is a quite fast implementation of a hashtable, is there anything better than that for my specific case?
回答1:
No there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing and even membership checking is approximately O(1).
Once you saved your items in a dictionary you can access them in constant time which means that it's unlikely that your performance problem has anything to do with the dictionary indexing. However, you might be able to make this process slightly faster by doing some changes in your objects and their types that may cause some optimizations in some of under the hood operations. For example, if your strings (keys) are not very large you can intern them, which makes them to be cashed in memory rather than being created as a separate object. If the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. That reduces the access time to the object faster.
Python has provided an intern() function within sys
module that you can use for this aim.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop
回答2:
No, I don't think there is anything faster than dict
. The time complexity of its index checking is O(1)
.
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity
回答3:
You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.
You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say
HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE
or something similar due to collision we need to use 256^i.
You can try to store them in Trie and see how it performs.
来源:https://stackoverflow.com/questions/40694470/is-there-anything-faster-than-dict