Intersecting two dictionaries in Python

后端未结

关注

 8  2073

I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short document

相关标签:

8条回答

生来不讨喜

2020-11-27 18:50

def two_keys(term_a, term_b, index):
    doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
    doc_store = index[term_a] # index[term_b] would work also
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

def n_keys(terms, index):
    doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
    doc_store = index[term[0]]
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

In [0]: index = {'a': {1: 'a b'}, 
                 'b': {1: 'a b'}}

In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}

In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

I would recommend changing your index from

index = {term: {doc_id: doc}}

to two indexes one for the terms and then a separate index to hold the values

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

that way you don't store multiple copies of the same data

0 讨论(0)

予麋鹿

2020-11-27 18:58
Your question isn't precise enough to give single answer.

1. Key Intersection

If you want to intersect IDs from posts (credits to James) do:
```
common_ids = p1.keys() & p2.keys()
```
However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:
```
from collections import ChainMap
intersection = {id: document
                for id, document in ChainMap(p1, p2)
                if id in common_ids}
for id, document in intersection:
    ...
```
Or if you don't want to create separate intersection dictionary:
```
from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
    document = posts[id]
```
2. Items Intersection

If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.
```
duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
    ...
```
3. Iterate over p1 'AND' p2.

In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:
```
from collections import ChainMap
for id, document in ChainMap(p1, p2):
    ...
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Intersecting two dictionaries in Python

1. Key Intersection

2. Items Intersection

3. Iterate over p1 'AND' p2.

3. Iterate over `p1` 'AND' `p2`.