问题

I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short documents, with ID numbers as keys and their text content as values.

To perform an \'AND\' search for two terms, I thus need to intersect their postings lists (dictionaries). What is a clear (not necessarily overly clever) way to do this in Python? I started out by trying it the long way with iter:

p1 = index[term1]  
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ...  # not sure of the \'iter != end \'syntax in this case
...

回答1:

You can easily calculate the intersection of sets, so create sets from the keys and use them for the intersection:

keys_a = set(dict_a.keys())
keys_b = set(dict_b.keys())
intersection = keys_a & keys_b # '&' operator is used for set intersection

回答2:

A little known fact is that you don't need to construct sets to do this:

In Python 2:

In [78]: d1 = {'a': 1, 'b': 2}

In [79]: d2 = {'b': 2, 'c': 3}

In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}

In Python 3 replace viewkeys with keys; the same applies to viewvalues and viewitems.

From the documentation of viewitems:

In [113]: d1.viewitems??
Type:       builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring:  D.viewitems() -> a set-like object providing a view on D's items

For larger dicts this also slightly faster than constructing sets and then intersecting them:

In [122]: d1 = {i: rand() for i in range(10000)}

In [123]: d2 = {i: rand() for i in range(10000)}

In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop

In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000 loops, best of 3: 805 µs per loop

For smaller `dict`s `set` construction is faster:

In [126]: d1 = {'a': 1, 'b': 2}

In [127]: d2 = {'b': 2, 'c': 3}

In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop

In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000000 loops, best of 3: 477 ns per loop

We're comparing nanoseconds here, which may or may not matter to you. In any case, you get back a set, so using viewkeys/keys eliminates a bit of clutter.

回答3:

In [1]: d1 = {'a':1, 'b':4, 'f':3}

In [2]: d2 = {'a':1, 'b':4, 'd':2}

In [3]: d = {x:d1[x] for x in d1 if x in d2}

In [4]: d
Out[4]: {'a': 1, 'b': 4}

回答4:

In Python 3, you can use

intersection = dict(dict1.items() & dict2.items())
union = dict(dict1.items() | dict2.items())
difference = dict(dict1.items() ^ dict2.items())

回答5:

Just wrap the dictionary instances with a simple class that gets both of the values you want

class DictionaryIntersection(object):
    def __init__(self,dictA,dictB):
        self.dictA = dictA
        self.dictB = dictB

    def __getitem__(self,attr):
        if attr not in self.dictA or attr not in self.dictB:
            raise KeyError('Not in both dictionaries,key: %s' % attr)

        return self.dictA[attr],self.dictB[attr]

x = {'foo' : 5, 'bar' :6}
y = {'bar' : 'meow' , 'qux' : 8}

z = DictionaryIntersection(x,y)

print z['bar']

回答6:

Okay, here is a generalized version of code above in Python3. It is optimized to use comprehensions and set-like dict views which are fast enough.

Function intersects arbitrary many dicts and returns a dict with common keys and a set of common values for each common key:

def dict_intersect(*dicts):
    comm_keys = dicts[0].keys()
    for d in dicts[1:]:
        # intersect keys first
        comm_keys &= d.keys()
    # then build a result dict with nested comprehension
    result = {key:{d[key] for d in dicts} for key in comm_keys}
    return result

Usage example:

a = {1: 'ba', 2: 'boon', 3: 'spam', 4:'eggs'}
b = {1: 'ham', 2:'baboon', 3: 'sausages'}
c = {1: 'more eggs', 3: 'cabbage'}

res = dict_intersect(a, b, c)
# Here is res (the order of values may vary) :
# {1: {'ham', 'more eggs', 'ba'}, 3: {'spam', 'sausages', 'cabbage'}}

Here the dict values must be hashable, if they aren't you could simply change set parentheses { } to list [ ]:

result = {key:[d[key] for d in dicts] for key in comm_keys}

回答7:

Your question isn't precise enough to give single answer.

1. Key Intersection

If you want to intersect IDs from posts (credits to James) do:

common_ids = p1.keys() & p2.keys()

However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:

from collections import ChainMap
intersection = {id: document
                for id, document in ChainMap(p1, p2)
                if id in common_ids}
for id, document in intersection:
    ...

Or if you don't want to create separate intersection dictionary:

from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
    document = posts[id]

2. Items Intersection

If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.

duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
    ...

3. Iterate over `p1` 'AND' `p2`.

In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:

from collections import ChainMap
for id, document in ChainMap(p1, p2):
    ...

回答8:

def two_keys(term_a, term_b, index):
    doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
    doc_store = index[term_a] # index[term_b] would work also
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

def n_keys(terms, index):
    doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
    doc_store = index[term[0]]
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

In [0]: index = {'a': {1: 'a b'}, 
                 'b': {1: 'a b'}}

In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}

In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

I would recommend changing your index from

index = {term: {doc_id: doc}}

to two indexes one for the terms and then a separate index to hold the values

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

that way you don't store multiple copies of the same data

来源：https://stackoverflow.com/questions/18554012/intersecting-two-dictionaries-in-python

标签

python

dictionary

iteration

intersection

Intersecting two dictionaries in Python

问题