Solr: Scores As Percentages

问题

First of all, I already saw the lucene doc which tells us to not produce score as percentages:

People frequently want to compute a "Percentage" from Lucene scores to determine what is a "100% perfect" match vs a "50%" match. This is also somethings called a "normalized score"

Don't do this.

Seriously. Stop trying to think about your problem this way, it's not going to end well.

Because of these recommandations, I used another way to solve my problem.

However, there are a few points of lucene's argumentation which I don't really understand why they are problematic in some cases.

For the case of this post, I can easily understand why it is bad: if a user does a search and sees the following results:

ProductA : 5 stars
ProductB : 2 stars
ProductC : 1 star

If ProductA was deleted after his first search, next time the user will come, he will be surprised if he sees the following results :

ProductB : 5 stars
ProductC : 3 stars

So, this problem is exactly what Lucene's doc is pointing out.

Now, let's take another example.

Imagine we have an e-commerce website which is using 'classic search' combined with phonetic search. The phonetic search is here to avoid a maximum number of null results due to spelling mistakes. The score of phonetic results is very low relative to scores of classic search.

In this case, the first idea was to only return results which have at least 10% of the maximum score. Results under this threshold will not be considered as relevant for us, even with classic search.

If I do that, I don't have the problem of the above post because if a document is deleted, it seems logical if the old second product became the first one and the user will not be very suprised (it is the same behavior as if I kept the score as float value).

Furthermore, if scores of phonetic search are very low, as we expect, we will keep the same behavior to only return relevant scores.

So my questions are: is it always bad to normalize score as Lucene advises? Is my example an exception or is it a bad idea to do this even for my example?

回答1:

The problem is, how do you determine your cutoff, and what does it mean?

Might be easier to look at an example. Say I'm trying to look for people by last name. I'm going to search for:

"smithfield"

And I have the following documents that I think are all a pretty good match:

smithfield - an exact match
smithfielde - Pretty close, sound alike, only one (silent) letter off
smythfield - Pretty close, sound alike, one vowel changed
smithfelt - Couple letters off, but still close and sound alike
snithfield - Not quite a soundalike, but only one letter off. Maybe a typo.
smittfield - Again, don't quite soundalike, maybe typo or misspelling
smythfelt - Spelling a fair bit off, but could be a mishearing
smithfieldings - Identical prefix

So, I've got four things I need to match. Exact match should be ensured to have the highest score, and we want prefix, fuzzy and sound-alike matches. So lets search for:

smithfield smithfield* smithfield~2 metaphone:sm0flt

Results

smithfield ::: 2.3430576
smithfielde ::: 0.97367656
smythfield ::: 0.5657166
smithfelt ::: 0.50767094

< 10% - Not displayed

snithfield ::: 0.2137136
smittfield ::: 0.2137136
smythfelt ::: 0.0691447
smithfieldings ::: 0.041700535

I thought smithfieldings was a pretty good match, but it's nowhere even close to making the cut! It's less that 2% of the maximum, nevermind 10%! Okay, so let's try boosting

smithfield^4 smithfield*^2 smithfield~2 metaphone:sm0flt

Results

smithfield ::: 2.8812196
smithfielde ::: 0.5907072
smythfield ::: 0.30413133

< 10% - Not displayed

smithfelt ::: 0.2729258
snithfield ::: 0.11489322
smittfield ::: 0.11489322
smithfieldings ::: 0.044836726
smythfelt ::: 0.037172448

That's even worse!

And in production the problem be worse still. In the real world, you may be dealing with long complex queries, and full text documents. Field length, repetitions of matches, coordination factors, boosts, and numerous query terms, all of it factors into the score.

It's really not all that unusual to see the first result be an order of magnitude higher in score than the second, even though the second is still a meaningful, interesting result. There isn't any guarantee of an even distribution of scores, so we don't know what the 10% figure means. And lucene's scoring algorithm tends to err on the side of making the differences nice and big.

Is it always bad? I'd say yes. As I see it, there are always two better options.

1 - Control your result set with good queries. If you construct your query well, then that will provide the cutoff of your results, not because of some arbitrary cutoff in score, but because it won't be scored at all.

2 - If you don't want to do that, do you really gain anything by cutting off results at that arbitrary point? Users are pretty good at recognizing when search results have gone off the deep end. A user not being able to find what they want is a serious annoyance. Showing too many results is usually a non-issue as long as they are ordered well.

回答2:

The Lucene score values are, as you've covered, only relevant for expressing the relative strength each match within a set of matches. Out of the context of a particular set of search results, the score for a particular record has no absolute meaning.

For this reason, the only appropriate normalization of the scores would be one to normalize the relationships between relevancy of documents within a result set, and even then you'll want to be very careful about how you employ this information.

Consider this result set, where we examine the score of each record as compared to the immediately preceding result:

ProductA         (Let's pretend the score is 10)
ProductB:  97%   (9.7)
ProductC:   8.5% (.82)
ProductD: 100%   (.82)
ProductE: 100%   (.82)
ProductF:  24%   (.2)

In this case, the first two results have very similar scores, while the next three have the same score but trail significantly. These numbers are clearly not to be shared with the shoppers online, but the low relative scores at ProductC and ProductF represent sharp enough drops that you could use them to inform other display options. Maybe ProductA and ProductB get displayed in a larger font than the others. If only one product appears before a precipitous drop, it could get even more special highlighting.

I would caution against completely suppressing relatively lower scored results in this kind of search. As you've already proven in your example, relative scores may be misleading, and unless your relevancy is very finely tuned the most relevant documents may not always be the most appropriate. It will do you no good if the desired results are dropped due to a single record that happens to repeat the search terms enough times to win a stellar score, and this is a real threat.

For example, "Hamilton Beach Three-In-One Convection Toaster Oven" will match one in eight words against a search for toaster, while "ToastMaster Toast Toaster Toasting Machine TOASTER" will match as many as five in seven words depending on how you index. (Both product names are completely made up, but I wanted the second one to look less reputable.)

Also, all returned documents are matches, no matter how low their scores might be. Sometimes a low-ranked result is the dark-horse find that the user really wants. Users will not understand that there are matching documents beyond what they see unless you tell them, so you might hide the trailing results on "page 2", or behind a cut, but you probably don't want to block them. Letting the user understand the size of their result set can also help them decide how to fine-tune their search. Using the significant drops in score as thresholds for paging could be very interesting, but probably a challenging implementation.

来源：https://stackoverflow.com/questions/29674709/solr-scores-as-percentages

标签

solr

lucene