Is there a better way to find set intersection for Search engine code?

问题

I have been coding up a small search engine and need to find out if there is a faster way to find set intersections. Currently, I am using a Sorted linked list as explained in most search engine algorithms. i.e for every word I have a list of documents sorted in a list and then find the intersection among the lists.

The performance profiling of the case is here. Any other ideas for a faster set intersection?

回答1:

An efficient way to do it is by "zig-zag":

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

回答2:

Here's a research paper that has a quantitave analysis for comparing current algorithms.

来源：https://stackoverflow.com/questions/9209693/is-there-a-better-way-to-find-set-intersection-for-search-engine-code

标签

algorithm

set

search-engine

intersection

information-retrieval