Is there a better way to find set intersection for Search engine code?

怎甘沉沦 提交于 2019-12-03 17:34:01

An efficient way to do it is by "zig-zag":

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

Here's a research paper that has a quantitave analysis for comparing current algorithms.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!