Efficient low-cardinality ANDs in a search engine

∥☆過路亽.° 提交于 2021-01-29 05:03:02

问题


How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:

term    | document_id
---------------------
program | 1, 2, 3, 5...
python  | 1, 4
code    | 4
c++     | 4, 5

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?


回答1:


the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Yes. Say you have the following query:

term1 AND term2 AND term3

You first need to compute the document frequency of each positive term. You pick the word with the lowest count.

You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.

So the database has several subspace:

  1. A mapping from lemma or stem or term to document frequency (as in tfidf)
  2. The actual inverted-index that allows to retrieve the documents that contains a given lemma
  3. A mapping between the document id and the full text representation of the document or just bag of words depending on how advanced is your query logic.

Then the filter + score step can happen in parallel



来源:https://stackoverflow.com/questions/64232411/efficient-low-cardinality-ands-in-a-search-engine

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!