Computing similarity between two lists

问题

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg,

1,7,4,5,8,9
1,7,5,4,9,6

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system. Does anyone know of any measures that can help me find the similarity between these two lists.

回答1:

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

Using DCG/nDCG to evaluate the system compared to the SOA base line:

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

Example:

mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

回答2:

Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

回答3:

As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists.

e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.

since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking. Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.

    D1 D2 D3 D4 D5 D6
    -----------------
R1: 1, 7, 4, 5, 8, 9  [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6  [ your Rankings]

Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.

You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)

Links you may find useful:
Link1
Link2
Link3

回答4:

In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.

The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.

Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).

For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
Return v12 / sqrt(v11 * v22)

For your example, this gives a value of 0.7252747.

Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.

回答5:

I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.

For measuring relevance or doing kind of A/B testing you need to have couple of things:

A competitor to measure relevance. As you have two systems than this prerequisite is met.
You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).

Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.

According to wikipedia:

Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.

Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

回答6:

Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.