Lucene: Completely disable weighting, scoring, ranking,

*爱你&永不变心* 提交于 2019-12-22 10:35:23

问题


I'm using Lucene to build a big index of token co-occurences (e.g. [elephant,animal], [melon,fruit], [bmw,car], ...). I query the index for those co-occurences using a BooleanQuery to get an absolute count, how often those two tokens co-occured in my index like so:

// search for documents which contain word+category
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("word", word)), Occur.MUST);
query.add(new TermQuery(new Term("category", category)), Occur.MUST);
// only care about the total number of hits
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query, collector);
int count = collector.getTotalHits();

These queries run very frequently and I'm currently not satisfied with performance. I discovered, that the method BooleanQuery#createWeight takes a lot of time. Now, I do not need any scoring or ranking of my results, as I'm interested in absolut documents counts only.

Is there a convenient way (pre-existing class e.g.) to completely disable scoring and weighting? If not, are there any hints which classes I need to extend for my use case?


回答1:


I'm not quite sure if it will bypass scoring in such a way as to get the performance increase you are looking for, but an easy way to apply a constant score would be to wrap the query in a ConstantScoreQuery, like:

BooleanQuery bq = new BooleanQuery();
//etc.
ConstantScoreQuery query = new ConstantScoreQuery(bq);
searcher.search(query, collector);

I would, however, strongly recommend making use of Filters. Not only do filters bypass score, they also cache their results, so your "category" field, particularly, seems like it would be a very good place for this. The first time you query in a category using a filter, it will take longer as it needs to build the cache for that filter, but after that, you should see a very significant increase in speed. Take a look at the FieldCacheTermsFilter.

Like:

Query query = new TermQuery(new Term("word", word));
Filter filter = new FieldCacheTermsFilter("category", category);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query, filter, collector);
int count = collector.getTotalHits();



回答2:


I had a similar problem, and came up with this solution which is compatible with Lucene 7. (Unfortunately the FieldCacheTermsFilter class, and the search method which accepts a filter is not available in Lucene 7).

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.SimpleCollector;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class UnscoredCollector extends SimpleCollector {
    private final List<Integer> docIds = new ArrayList<>();
    private LeafReaderContext currentLeafReaderContext;

    @Override
    protected void doSetNextReader(LeafReaderContext context) throws IOException {
        this.currentLeafReaderContext = context;
    }

    @Override
    public boolean needsScores(){
        return false;
    }

    @Override
    public void collect(int localDocId) {
        docIds.add(currentLeafReaderContext.docBase + localDocId);
    }
}

...and then used the collector when searching

UnscoredCollector collector = new UnscoredCollector();
indexSearcher.search(query, collector);
//docId's are now available in the docIds ArrayList in the UnscoredCollector


来源:https://stackoverflow.com/questions/22744858/lucene-completely-disable-weighting-scoring-ranking

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!