Finding matching keys in two large dictionaries and doing it fast

问题

I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries.

Say for example:

    myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
    myNames = { 'Actinobacter': '8924342' }

I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP.

The following code works, but is very slow:

    for key in myRDP:
        for jey in myNames:
            if key == jey:
                print key, myNames[key]

I've tried the following but it always results in a KeyError:

    for key in myRDP:
        print myNames[key]

Is there perhaps a function implemented in C for doing this? I've googled around but nothing seems to work.

Thanks.

回答1:

Use sets, because they have a built-in intersection method which ought to be quick:

myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
myNames = { 'Actinobacter': '8924342' }

rdpSet = set(myRDP)
namesSet = set(myNames)

for name in rdpSet.intersection(namesSet):
    print name, myNames[name]

# Prints: Actinobacter 8924342

回答2:

You could do this:

for key in myRDP:
    if key in myNames:
        print key, myNames[key]

Your first attempt was slow because you were comparing every key in myRDP with every key in myNames. In algorithmic jargon, if myRDP has n elements and myNames has m elements, then that algorithm would take O(n×m) operations. For 600k elements each, this is 360,000,000,000 comparisons!

But testing whether a particular element is a key of a dictionary is fast -- in fact, this is one of the defining characteristics of dictionaries. In algorithmic terms, the key in dict test is O(1), or constant-time. So my algorithm will take O(n) time, which is one 600,000th of the time.

回答3:

in python 3 you can just do

myNames.keys() & myRDP.keys()

回答4:

for key in myRDP:
    name = myNames.get(key, None)
    if name:
        print key, name

dict.get returns the default value you give it (in this case, None) if the key doesn't exist.

回答5:

You could start by finding the common keys and then iterating over them. Set operations should be fast because they are implemented in C, at least in modern versions of Python.

common_keys = set(myRDP).intersection(myNames)
for key in common_keys:
    print key, myNames[key]

回答6:

Use the get method instead:

 for key in myRDP:
    value = myNames.get(key)
    if value != None:
      print key, "=", value

回答7:

Best and easiest way would be simply perform common set operations(Python 3).

a = {"a": 1, "b":2, "c":3, "d":4}
b = {"t1": 1, "b":2, "e":5, "c":3}
res = a.items() & b.items() # {('b', 2), ('c', 3)} For common Key and Value
res = {i[0]:i[1] for i in res}  # In dict format
common_keys = a.keys() & b.keys()  # {'b', 'c'}

Cheers!

回答8:

Copy both dictionaries into one dictionary/array. This makes sense as you have 1:1 related values. Then you need only one search, no comparison loop, and can access the related value directly.

Example Resulting Dictionary/Array:

[Name][Value1][Value2]

[Actinobacter][GATCGA...TCA][8924342]

[XYZbacter][BCABCA...ABC][43594344]

...

回答9:

Here is my code for doing intersections, unions, differences, and other set operations on dictionaries:

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
        self.current_dict, self.past_dict = current_dict, past_dict
        self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
        self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
        return self.set_current - self.intersect 
    def removed(self):
        return self.set_past - self.intersect 
    def changed(self):
        return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
        return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

if __name__ == '__main__':
    import unittest
    class TestDictDifferNoChanged(unittest.TestCase):
        def setUp(self):
            self.past = dict((k, 2*k) for k in range(5))
            self.current = dict((k, 2*k) for k in range(3,8))
            self.d = DictDiffer(self.current, self.past)
        def testAdded(self):
            self.assertEqual(self.d.added(), set((5,6,7)))
        def testRemoved(self):      
            self.assertEqual(self.d.removed(), set((0,1,2)))
        def testChanged(self):
            self.assertEqual(self.d.changed(), set())
        def testUnchanged(self):
            self.assertEqual(self.d.unchanged(), set((3,4)))
    class TestDictDifferNoCUnchanged(unittest.TestCase):
        def setUp(self):
            self.past = dict((k, 2*k) for k in range(5))
            self.current = dict((k, 2*k+1) for k in range(3,8))
            self.d = DictDiffer(self.current, self.past)
        def testAdded(self):
            self.assertEqual(self.d.added(), set((5,6,7)))
        def testRemoved(self):      
            self.assertEqual(self.d.removed(), set((0,1,2)))
        def testChanged(self):
            self.assertEqual(self.d.changed(), set((3,4)))
        def testUnchanged(self):
            self.assertEqual(self.d.unchanged(), set())
    unittest.main()

来源：https://stackoverflow.com/questions/1317410/finding-matching-keys-in-two-large-dictionaries-and-doing-it-fast

标签

python

bioinformatics