问题
I would like to recognize and group duplicates values in a dictionary. To do this I build a pseudo-hash (better read signature) of my data set as follow:
from pickle import dumps
taxonomy = {}
binder = defaultdict(list)
for key, value in ds.items():
signature = dumps(value)
taxonomy[signature] = value
binder[signature].append(key)
For a concrete use-case see this question.
Unfortunately I realized that if the following statement is True:
>>> ds['key1'] == ds['key2']
True
This one is not always True anymore:
>>> dumps(ds['key1']) == dumps(ds['key2'])
False
I notice the key order on the dumped output differ for both dict. If I copy/paste the output of ds['key1'] and ds['key2'] into new dictionaries I can make the comparison successful.
As an overkill alternative I could traverse my dataset recursively and replace dict instances with OrderedDict:
import copy
def faithfulrepr(od):
od = od.deepcopy(od)
if isinstance(od, collections.Mapping):
res = collections.OrderedDict()
for k, v in sorted(od.items()):
res[k] = faithfulrepr(v)
return repr(res)
if isinstance(od, list):
for i, v in enumerate(od):
od[i] = faithfulrepr(v)
return repr(od)
return repr(od)
>>> faithfulrepr(ds['key1']) == faithfulrepr(ds['key2'])
True
I am worried about this naive approach because I do not know whether I cover all the possible situations.
What other (generic) alternative can I use?
回答1:
The first thing is to remove the call to deepcopy which is your bottleneck here:
def faithfulrepr(ds):
if isinstance(ds, collections.Mapping):
res = collections.OrderedDict(
(k, faithfulrepr(v)) for k, v in sorted(ds.items())
)
elif isinstance(ds, list):
res = [faithfulrepr(v) for v in ds]
else:
res = ds
return repr(res)
However sorted and repr have their drawbacks:
- you can't trully compare custom types;
- you can't use mappings with different types of keys.
So the second thing is to get rid of faithfulrepr and compare objects with __eq__:
binder, values = [], []
for key, value in ds.items():
try:
index = values.index(value)
except ValueError:
values.append(value)
binder.append([key])
else:
binder[index].append(key)
grouped = dict(zip(map(tuple, binder), values))
来源:https://stackoverflow.com/questions/40976060/find-duplicates-for-mixed-type-values-in-dictionaries