Efficient low-cardinality ANDs in a search engine

问题

How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:

term    | document_id
---------------------
program | 1, 2, 3, 5...
python  | 1, 4
code    | 4
c++     | 4, 5

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?

回答1:

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Yes. Say you have the following query:

term1 AND term2 AND term3

You first need to compute the document frequency of each positive term. You pick the word with the lowest count.

You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.

So the database has several subspace:

A mapping from lemma or stem or term to document frequency (as in tfidf)
The actual inverted-index that allows to retrieve the documents that contains a given lemma
A mapping between the document id and the full text representation of the document or just bag of words depending on how advanced is your query logic.

Then the filter + score step can happen in parallel

来源：https://stackoverflow.com/questions/64232411/efficient-low-cardinality-ands-in-a-search-engine

标签

indexing

lucene

search-engine

inverted-index