Different lucene search results using different search space size

问题

I have an application that uses lucene for searching. The search space are in the thousands. Searching against these thousands, I get only a few results, around 20 (which is ok and expected).

However, when I reduce my search space to just those 20 entries (i.e. I indexed only those 20 entries and disregard everything else...so that development would be easier), I get the same 20 results but in different order (and scoring).

I tried disabling the norm factors via Field#setOmitNorms(true), but I still get different results?

What could be causing the difference in the scoring?

Thanks

回答1:

Please see the scoring documentation in Lucene's Similarity API. My bet is on the difference in idf between the two cases (both numDocs and docFreq are different). In order to know for sure, use the explain() function to debug the scores.

Edit: A code fragment for getting explanations:

TopDocs hits = searcher.search(query, searchFilter, max);
ScoreDoc[] scoreDocs = hits.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
  String explanation = searcher.explain(query, scoreDoc.doc).toString();
  Log.debug(explanation);
}

回答2:

Scoring depends on all the documents in the index:

In general, the idea behind the Vector Space Model (VSM) is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query.

Source: Apache Lucene - Scoring

来源：https://stackoverflow.com/questions/1742124/different-lucene-search-results-using-different-search-space-size

标签

java

lucene

size

scoring