Complexity of a Lucene's search

偶尔善良 提交于 2019-12-03 05:57:34

Lucene basically uses a Vector Space Model (VSM) with a tf-idf scheme. So, in the standard setting we have:

  • A collection of documents each represented as a vector
  • A text query also represented as a vector

We determine the K documents of the collection with the highest vector space scores on the query q. Typically, we seek these K top documents ordered by score in decreasing order; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.

The basic algorithm for computing vector space scores is:

float Scores[N] = 0
Initialize Length[N]
for each query term t
do calculate w(t,q) and fetch postings list for t (stored in the index)
    for each pair d,tf(t,d) in postings list
    do Scores[d] += wf(t,d) X w(t,q)  (dot product)
Read the array Length[d]
for each d
do Scored[d] = Scores[d] / Length[d]
return Top K components of Scores[]

Where

  • The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents.
  • tf is the term frequency of a term in a document.
  • w(t,q) is the weight of the submitted query for a given term. Note that query is treated as a bag of words and the vector of weights can be considered (as if it was another document).
  • wf(d,q) is the logarithmic term weighting for query and document

As described here: Complexity of vector dot-product, vector dot-product is O(n). Here the dimension is the number of terms in our vocabulary: |T|, where T is the set of terms.

So, the time complexity of this algorithm is:

O(|Q|· |D| · |T|) = O(|D| · |T|) 

we consider |Q| fixed, where Q is the set of words in the query (which average size is low, in average a query has between 2 and 3 terms) and D is the set of all documents.

However, for a search, these sets are bounded and indexes don't tend to grow very often. So, as a result, searches using VSM are really fast (when T and D are large the search is really slow and one has to find an alternative approach).

D is the set of all documents

before (honestly, along side) VSM, the boolean retrieval is invoked. Thus, we can say d is matching docs only (almost. ok. in the best case). Since Scores is priority queue (at least in doc-at-time-scheme) build on heap, putting every d into takes log(K). Therefore we can estimate it as O(d·log(K)), here I omitting T since query is expected to be short. (Otherwise, you are in a trouble).

http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!