Lucene search by URL

六眼飞鱼酱① 提交于 2019-12-22 10:10:29

问题


I'm storing a Document which has a URL field:

Document doc = new Document();
doc.add(new Field("url", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", text, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("html", CompressionTools.compressString(html), Field.Store.YES));

I'd like to be able to find a Document by its URL, but I get 0 results:

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30)
Query query = new QueryParser(LUCENE_VERSION, "url", analyzer).parse(url);
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// Display results
for (ScoreDoc hit : hits) {
  System.out.println("FOUND A MATCH");
}
searcher.close();

What can I do differently so that I can store an HTML document and find it by its URL?


回答1:


You may rewrite your query to something like this

Query query = new QueryParser(LUCENE_VERSION, "url", analyzer).newTermQuery(new Term("url", url)).parse(url);

Suggestion:

I suggest you use BooleanQuery since it gives good performance and internally it is optimized.

TermQuery tq= new TermQuery(new Term("url", url));
// BooleanClauses Enum SHOULD says Use this operator for clauses that should appear in the matching documents.
BooleanQuery bq = new BooleanQuery().add(tq,BooleanClause.Occur.SHOULD);
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);

I see you are indexing using URL frield as Not_Analysed, which is good IMO for searching, As no analyzer is used the value will be stored as a single term.

Now if your business case says, i will give you a URL find the EXACT one from the Lucene Index then you shall look at your indexing with a different analyzer(KeywordAnalyzer etc)




回答2:


The Lucene QueryParser is interpreting some of the url characters as part of the Query Parser Syntax. You can use a TermQuery instead, like so:

TermQuery query = new TermQuery(new Term("url", url));


来源:https://stackoverflow.com/questions/5321388/lucene-search-by-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!