问题
In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:
class PersistentDict(object):
def __setitem__(self, key, value):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
self._store_to_bucket(bucket_index, key, value)
def __getitem__(self, key):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
return self._fetch_from_bucket(bucket_index)[key]
In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:
- Must be stable across interpreter invocations
- May require parameters supplied at execution time, e.g. setting a salt in the call
- Must support arbitrary objects (anything supported by
dict/set)
I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.
[1] Using hash() to identify buckets is either:
- Not reliable across interpreter invocations, if salts are random
- Prevents applications from using the random salting feature, if salts are fixed
- Unusable if two
PersistentDicts were created with different salts
回答1:
I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:
def hashkey(obj, salt=0):
"""
Create a key suitable for use in hashmaps
:param obj: object for which to create a key
:type: str, bytes, :py:class:`datetime.datetime`, object
:param salt: an optional salt to add to the key value
:type salt: int
:return: numeric key to `obj`
:rtype: int
"""
if obj is None:
return 0
if isinstance(obj, str):
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif isinstance(obj, bytes):
return zlib.adler32(obj, salt) & 0xffffffff
elif isinstance(obj, datetime_type):
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.
Overhead is roughly as follows:
- 0.1 usec for the function call (
hash(obj)vsdef pyhash(obj): return hash(obj)) - 0.2 usec to 0.5 usec for selecting the hash function via
isinstance - 0.75 usec for
zlib.adler32orzlib.crc32vshash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec) - 0.15 usec for
obj.encode()ofstrobjects ("foobar") - 1.5 usec for
str(obj).encode()ofdatetime.datetimeobjects
The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:
def hashkey_c(obj, salt=0):
if obj.__class__ in hashkey_c.types:
if obj is None:
return 0
if obj.__class__ is str:
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif obj.__class__ is bytes:
return zlib.adler32(obj, salt) & 0xffffffff
elif obj.__class__ is datetime_type:
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}
Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).
Some additional notes:
hashis not stable across interpreter starts for any object using the default__hash__implementation, includingNone- It does not work properly for immutable containers (which define
__hash__) containing a salted type, e.g.(1, 2, 'three')
来源:https://stackoverflow.com/questions/38009699/alternative-to-python-hash-function-for-arbitrary-objects