问题
I have a dict
that takes integer keys:
a = {}
a[1] = 100
a[55] = 101
a[127] = 102
I would like to be able to take the nearest neighbour when asking :
a[20] # should return a[1] = 100
a[58] # should return a[55] = 101
a[167] # should return a[127] = 102
Is there a pythonic way of doing this? (I imagine this can be done by looping on all dict, but that's probably not the most elegant solution?)
Same question with double-index (integers as well) :
b[90, 1] = 100, b[90, 55] = 101, b[90, 127] = 102
b[70, 1] = 40, b[70, 45] = 41, b[70, 107] = 42
I would like to be able to get b[73, 40] = b[70, 45] = 41
, i.e. nearest neighboor in a 2D plane.
回答1:
Something similar to this:
class CustomDict(dict):
def __getitem__(self, key):
try:
return dict.__getitem__(self, key)
except KeyError:
closest_key = min(self.keys(), key=lambda x: abs(x - key))
return dict.__getitem__(self, closest_key)
Or this:
class CustomDict(dict):
def __getitem__(self, key):
if key in self:
return dict.__getitem__(self, key)
else:
closest_key = min(self.keys(), key=lambda x: abs(x - key))
return dict.__getitem__(self, closest_key)
Both gives this result:
a = CustomDict()
a[1] = 100
a[55] = 101
a[127] = 102
print a[20] # prints 100
print a[58] # prints 101
print a[167] # prints 102
And for the double index version:
class CustomDoubleDict(dict):
def __getitem__(self, key):
if key in self:
return dict.__getitem__(self, key)
else:
closest_key = min(self.keys(), key=lambda c: (c[0] - key[0]) ** 2 + (c[1] - key[1]) ** 2)
return dict.__getitem__(self, closest_key)
b = CustomDoubleDict()
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42
print b[73, 40] # prints 41
print b[70, 45] # prints 41
回答2:
Update: After benchmarking the two approaches in this answer, the second approach is significantly better, to the point that it should almost be strictly preferred.
The following approach handles n-dimensions identically:
class NearestDict(dict):
def __init__(self, ndims):
super(NearestDict, self).__init__()
self.ndims = ndims
# Enforce dimensionality
def __setitem__(self, key, val):
if not isinstance(key, tuple): key = (key,)
if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
super(NearestDict, self).__setitem__(key, val)
@staticmethod
def __dist(ka, kb):
assert len(ka) == len(kb)
return sum((ea-eb)**2 for (ea, eb) in zip(ka, kb))
# Helper method and might be of use
def nearest_key(self, key):
if not isinstance(key, tuple): key = (key,)
nk = min((k for k in self), key=lambda k: NearestDict.__dist(key, k))
return nk
def __missing__(self, key):
if not isinstance(key, tuple): key = (key,)
if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
return self[self.nearest_key(key)]
Demo:
a = NearestDict(1)
a[1] = 100
a[55] = 101
a[127] = 102
print a[20] # 100
print a[58] # 100
print a[167] # 102
print a.nearest_key(20) # (1,)
print a.nearest_key(58) # (55,)
print a.nearest_key(127) # (127,)
b = NearestDict(2)
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42
print b[73, 40] # 41
print b.nearest_key((73,40)) # (70, 45)
Note that if the key exists, the lookup is no slower than a standard dictionary lookup. If the key does not exist, you compute the distance between every existing key. Nothing is cached, although you could tack that on I suppose.
Edit:
From the approach suggested by Kasra's answer the following approach implements the same class as above using scipy's cKDTree:
Note that there is an additional optional argument, regenOnAdd
that will allow you to defer (re)-building the KDTree until after you have completed (the majority of) your inserts:
from scipy.spatial import cKDTree
class KDDict(dict):
def __init__(self, ndims, regenOnAdd=False):
super(KDDict, self).__init__()
self.ndims = ndims
self.regenOnAdd = regenOnAdd
self.__keys = []
self.__tree = None
self.__stale = False
# Enforce dimensionality
def __setitem__(self, key, val):
if not isinstance(key, tuple): key = (key,)
if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
self.__keys.append(key)
self.__stale = True
if self.regenOnAdd: self.regenTree()
super(KDDict, self).__setitem__(key, val)
def regenTree(self):
self.__tree = cKDTree(self.__keys)
self.__stale = False
# Helper method and might be of use
def nearest_key(self, key):
if not isinstance(key, tuple): key = (key,)
if self.__stale: self.regenTree()
_, idx = self.__tree.query(key, 1)
return self.__keys[idx]
def __missing__(self, key):
if not isinstance(key, tuple): key = (key,)
if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
return self[self.nearest_key(key)]
Output is the same as the above approach.
Benchmark Results
To understand the performance of the three approaches (NearestDict
, KDDict(True)
(regen on insert), and KDDict(False)
(defer regen)), I briefly benchmarked them.
I ran 3 different tests. The parameters that stayed the same across the tests were:
- Number of Test Iterations: I did each test 5 times and took the minimum time. (Note
timeit.repeat
defaults to 3). - Point boundaries: I generated integer key points in the range 0 <= x < 1000
- Number of lookups: I timed inserts and lookups separately. The three tests below all use 10,000 lookups.
The first test used keys of 4 dimensions, and 1,000 insertions.
{'NDIMS': 4, 'NITER': 5, 'NELEMS': 1000, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100} insert::NearestDict 0.125 insert::KDDict(regen) 35.957 insert::KDDict(defer) 0.174 search::NearestDict 2636.965 search::KDDict(regen) 49.965 search::KDDict(defer) 51.880
The second test used keys of 4 dimensions and 100 insertions. I wanted to vary the number of insertions to see how well the two approaches performed as the density of the dictionary varied.
{'NDIMS': 4, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100} insert::NearestDict 0.013 insert::KDDict(regen) 0.629 insert::KDDict(defer) 0.018 search::NearestDict 247.920 search::KDDict(regen) 44.523 search::KDDict(defer) 44.718
The third test used 100 insertions (like the second test) but 12 dimensions. I wanted to see how the approaches performed as key dimensionality increased.
{'NDIMS': 12, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100} insert::NearestDict 0.013 insert::KDDict(regen) 0.722 insert::KDDict(defer) 0.017 search::NearestDict 405.092 search::KDDict(regen) 49.046 search::KDDict(defer) 50.601
Discussion
KDDict
with continuous regeneration (KDDict(True)
) is either fractionally faster (in lookup) or considerably slower (in insertion). Because of this, I'm leaving it out of the discussion and focusing on NearestDict
and KDDict(False)
, now referred to as simply KDDict
The results were surprisingly in favor of KDDict with deferred regeneration.
For insertion, in all cases, KDDict performed slightly worse than NearestDict. This was to be expected because of the additional list append operation.
For search, in all cases, KDDict performed significantly better than NearestDict.
As sparsity of the dictionary decreased / density increased, NearestDict's performance decreased to a much greater extent than KDDict. When going from 100 keys to 1000 keys, NearestDict search time increased by 9.64x while KDDict search time increased by only 0.16x.
As the dimensionality of the dictionary was increased, NearestDict's performance decreased to a greater than extent than KDDict. When going from 4 to 12 dimensions, NearestDict search time increased by 0.64x while KDDict search time increased by only 0.13x.
In light of this, and the relatively equal complexity of the two classes, if you have access to the scipy toolkit, using the KDDict
approach is strongly recommended.
回答3:
Option 1:
Maintain a separate and ordered list of the keys (or use an OrderedDict
). Find the nearest key using binary search. This should be O(log n).
Option 2: (If the data is not very large and sparse)
Since you mentioned the dict is static, make a pass through the dict to populate all the missing values once. Maintain the maximum and minimum keys, and override the __getitem__
so that keys above the maximum or below the minimum return the correct value. This should be O(1).
Option 3:
Just use the loop over the keys every time, it will be O(n). Try it in your application and you might find that the simple solution is quite fast and adequate anyway.
回答4:
What about using min
with a proper key function :
>>> b ={(90, 55): 101, (90, 127): 102, (90, 1): 100}
>>> def nearest(x,y):
... m=min(((i,j) for i,j in b ),key= lambda v:abs(v[0]-x)+abs(v[1]-y))
... return b[m]
...
>>> nearest(40,100)
102
>>> nearest(90,100)
102
>>> b
{(90, 55): 101, (90, 127): 102, (90, 1): 100}
>>> nearest(90,10)
100
The preceding answer is a pythonic answer that i suggested but if you look for a fast way you can use scipy.spatial.KDTree
:
class scipy.spatial.KDTree(data, leafsize=10)
kd-tree for quick nearest-neighbor lookup
This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.
Also have a look at http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html#scipy-spatial-ckdtree
and http://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance
回答5:
Untested O(log n) example for the one-dimensional case:
import collections
import bisect
class ProximityDict(collections.MutableMapping):
def __init__(self, *args, **kwargs):
self.store = dict()
self.ordkeys = []
self.update(dict(*args, **kwargs))
def __getitem__(self, key):
try: return self.store[key]
except KeyError:
cand = bisect.bisect_left(self.ordkeys, key)
if cand == 0: return self.store[self.ordkeys[0]]
return self.store[
min(self.ordkeys[cand], self.ordkeys[cand-1],
key=lambda k: abs(k - key))
]
def __setitem__(self, key, value):
if not key in self.store: bisect.insort_left(self.ordkeys, key)
self.store[key] = value
def __delitem__(self, key):
del self.store[key]
del self.keys[bisect.bisect_left(self.ordkeys, key)]
def __iter__(self):
return iter(self.store)
def __len__(self):
return len(self.store)
The two-dimensional case is significantly more annoying (you'd need to store keys ordered in a quadtree), but is possible in a similar way.
I did not code deleting to also have 'proximity' behaviour, but you could do that as well.
回答6:
Here come one pythonic solution that rely mainly on map/filter operations
class NeirestSearchDictionnary1D(dict):
""" An extended dictionnary that returns the value that is the nearest to
the requested key. As it's key distance is defined for simple number
values, trying to add other keys will throw error. """
def __init__(self):
""" Constructor of the dictionnary.
It only allow to initialze empty dict """
dict.__init__(self)
def keyDistance(self, key1, key2):
""" returns a distance between 2 dic keys """
return abs(key1-key2)
def __setitem__(self, key, value):
""" override the addition of a couple in the dict."""
#type checking
if not (isinstance(key, int) or isinstance(key, float)):
raise TypeError("The key of such a "+ type(self) + "must be a simple numerical value")
else:
dict.__setitem__(self, key, value)
def __getitem__(self, key):
""" Override the getting item operation """
#compute the minial distance
minimalDistance = min(map(lambda x : self.keyDistance(key, x), self.keys()))
#get the list of key that minimize the distance
resultSetKeys = filter(lambda x : self.keyDistance(key, x) <= minimalDistance, self.keys())
#return the values binded to the keys minimizing the distances.
return list(map(lambda x : dict.__getitem__(self, x), resultSetKeys))
if __name__ == "__main__":
dic = NeirestSearchDictionnary1D()
dic[1] = 100
dic[55] = 101
dic[57] = 102
dic[127] = 103
print("the entire dict :", dic)
print("dic of '20'", dic[20])
print("dic of '56'", dic[56])
Obviously, you can extends this to 2D dimension with little work.
来源:https://stackoverflow.com/questions/29094458/find-integer-nearest-neighbour-in-a-dict