Find duplicates for mixed type values in dictionaries

倾然丶 夕夏残阳落幕 提交于 2019-12-24 08:02:48

问题


I would like to recognize and group duplicates values in a dictionary. To do this I build a pseudo-hash (better read signature) of my data set as follow:

from pickle import dumps
taxonomy = {}
binder = defaultdict(list)
for key, value in ds.items():
    signature = dumps(value)
    taxonomy[signature] = value
    binder[signature].append(key)   

For a concrete use-case see this question.

Unfortunately I realized that if the following statement is True:

>>> ds['key1'] == ds['key2']
True

This one is not always True anymore:

>>> dumps(ds['key1']) == dumps(ds['key2'])
False

I notice the key order on the dumped output differ for both dict. If I copy/paste the output of ds['key1'] and ds['key2'] into new dictionaries I can make the comparison successful.

As an overkill alternative I could traverse my dataset recursively and replace dict instances with OrderedDict:

import copy
def faithfulrepr(od):
    od = od.deepcopy(od)
    if isinstance(od, collections.Mapping):
        res = collections.OrderedDict()
        for k, v in sorted(od.items()):
            res[k] = faithfulrepr(v)
        return repr(res)
    if isinstance(od, list):
        for i, v in enumerate(od):
            od[i] = faithfulrepr(v)
        return repr(od)
    return repr(od)

>>> faithfulrepr(ds['key1']) == faithfulrepr(ds['key2'])
True

I am worried about this naive approach because I do not know whether I cover all the possible situations.

What other (generic) alternative can I use?


回答1:


The first thing is to remove the call to deepcopy which is your bottleneck here:

def faithfulrepr(ds):
    if isinstance(ds, collections.Mapping):
        res = collections.OrderedDict(
            (k, faithfulrepr(v)) for k, v in sorted(ds.items())
        )
    elif isinstance(ds, list):
        res = [faithfulrepr(v) for v in ds]
    else:
        res = ds
    return repr(res)

However sorted and repr have their drawbacks:

  1. you can't trully compare custom types;
  2. you can't use mappings with different types of keys.

So the second thing is to get rid of faithfulrepr and compare objects with __eq__:

binder, values = [], []
for key, value in ds.items():
    try:
        index = values.index(value)
    except ValueError:
        values.append(value)
        binder.append([key])
    else:
        binder[index].append(key)
grouped = dict(zip(map(tuple, binder), values))


来源:https://stackoverflow.com/questions/40976060/find-duplicates-for-mixed-type-values-in-dictionaries

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!