Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

我怕爱的太早我们不能终老 提交于 2019-12-31 00:03:08

问题


I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?

Current dictionary keys are between 10-20 long, alphanumeric.

I need to do hundreds of lookups a minute.

If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.

Also I'll need to change the program to lookup the ID then get the URL associated with the ID.

Am I likely just adding complexity to the program with little benefit?


回答1:


Dictionaries are hash tables, so looking up a key consists of:

  • Hash the key.
  • Reduce the hash to the table size.
  • Index the table with the result.
  • Compare the looked-up key with the input key.

Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.


Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.

On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.

From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.


Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.

Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.


More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.




回答2:


Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.

So I'd say there is little to no benefit for the added complexity.



来源:https://stackoverflow.com/questions/26496226/optimizing-python-dictionary-lookup-speeds-by-shortening-key-size

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!