I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short document
def two_keys(term_a, term_b, index):
doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
doc_store = index[term_a] # index[term_b] would work also
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
def n_keys(terms, index):
doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
doc_store = index[term[0]]
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
In [0]: index = {'a': {1: 'a b'},
'b': {1: 'a b'}}
In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}
In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}
I would recommend changing your index from
index = {term: {doc_id: doc}}
to two indexes one for the terms and then a separate index to hold the values
term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}
that way you don't store multiple copies of the same data
Your question isn't precise enough to give single answer.
If you want to intersect IDs from posts (credits to James) do:
common_ids = p1.keys() & p2.keys()
However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:
from collections import ChainMap
intersection = {id: document
for id, document in ChainMap(p1, p2)
if id in common_ids}
for id, document in intersection:
...
Or if you don't want to create separate intersection dictionary:
from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
document = posts[id]
If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.
duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
...
p1 'AND' p2.In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:
from collections import ChainMap
for id, document in ChainMap(p1, p2):
...