Lucene: exact matches aren't shown first

…衆ロ難τιáo~ 提交于 2020-01-03 05:49:07

问题


I am using the demo IndexFiles and SearchFiles classes to index and search which are in org.apache.lucene.demo packet.

My issue is when I use a query that contains more than a word, I am not getting results that have the exact match. For instance:

Enter query:
"natural language"
Searching for: "natural language"
298 total matching documents
1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing
.txt
2. download\researchers.uq.edu.au\research-project\16267.txt
3. download\researchers.uq.edu.au\research-project\16279.txt
4. download\researchers.uq.edu.au\research-project\18361.txt
5. download\www.uq.edu.au\news\%3Farticle%3D2187.txt
6. download\researchers.uq.edu.au\researcher\2115.txt
7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody%3Fpage%3D1.txt
8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody%3Fpage%3D2.txt
9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody.txt
10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr
ojects-dr-alan-cody.txt
Press (n)ext page, (q)uit or enter number to jump to a page.

does not have same results as:

Enter query:
natural language
Searching for: natural language
54307 total matching documents
1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt

2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D576.txt

3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D46.txt
4. download\espace.library.uq.edu.au\view\UQ%3A166163.txt
5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D108.txt

6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D70.txt
7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D708.txt

8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing
.txt
9. download\researchers.uq.edu.au\research-project\16267.txt
10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D117.tx
t
Press (n)ext page, (q)uit or enter number to jump to a page.

For instance the first matching document does not even contain "language" keyword.

If I use explain() method within IndexSearcher class then I am getting this result for 1st one:

1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt
0.70643383 = (MATCH) sum of:
  0.5590494 = (MATCH) weight(contents:natural in 62541) [DefaultSimilarity], result of:
    0.5590494 = score(doc=62541,freq=4.0 = termFreq=4.0
), product of:
      0.8091749 = queryWeight, product of:
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.18300149 = queryNorm
      0.6908882 = fieldWeight in 62541, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.078125 = fieldNorm(doc=62541)
  0.1473844 = (MATCH) weight(contents:language in 62541) [DefaultSimilarity], result of:
    0.1473844 = score(doc=62541,freq=1.0 = termFreq=1.0
), product of:
      0.5875679 = queryWeight, product of:
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.18300149 = queryNorm
      0.25083807 = fieldWeight in 62541, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.078125 = fieldNorm(doc=62541)

If I click next and find a result such as this:

19. download\www.uq.edu.au\news\%3Farticle%3D2187.txt
0.47449595 = (MATCH) sum of:
  0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of:
    0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0
), product of:
      0.8091749 = queryWeight, product of:
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.18300149 = queryNorm
      0.3454441 = fieldWeight in 35173, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.0390625 = fieldNorm(doc=35173)
  0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of:
    0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0
), product of:
      0.5875679 = queryWeight, product of:
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.18300149 = queryNorm
      0.33182758 = fieldWeight in 35173, product of:
        2.6457512 = tf(freq=7.0), with freq of:
          7.0 = termFreq=7.0
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.0390625 = fieldNorm(doc=35173)

which page itself contains exact keyword "natural language". So my questions are:

1) Why Lucene does not show exact matches first?

2) Why Lucene shows a result that does not even contain a keyword?

3) Where/how can I change that so that it would first show exact matching ones and then more relevant ones?


回答1:


1 - It isn't intended to. See the documentation on Lucene query syntax. The query natural language is a query made up of two terms. On their own, Lucene has no preference for the terms be close together. If you want to find exact matches, a phrase query is the correct approach, like "natural language"

2 - Both results in which you included an explaination do contain matches for both terms, see:

0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of:
  0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0
...
0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of:
  0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0

According to Lucene, it found the term "natural" 4 times in that document, and "language" 7 times, in the content field (which I assume is your default field).

3 - Look over the query parser syntax, to see what makes the most sense to you. It sounds like you might find Proximity Searches useful.

If you just want to simply get phrase matches followed by others, you could use something along the lines of:

"natural language" natural language


来源:https://stackoverflow.com/questions/19217634/lucene-exact-matches-arent-shown-first

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!