Sorted list difference | 易学教程

问题

I have the following problem.

I have a set of elements that I can sort by a certain algorithm A . The sorting is good, but very expensive.

There is also an algorithm B that can approximate the result of A. It is much faster, but the ordering will not be exactly the same.

Taking the output of A as a 'golden standard' I need to get a meaningful estimate of the error resulting of the use of B on the same data.

Could anyone please suggest any resource I could look at to solve my problem? Thanks in advance!

EDIT :

As requested : adding an example to illustrate the case : if the data are the first 10 letters of the alphabet,

A outputs : a,b,c,d,e,f,g,h,i,j

B outputs : a,b,d,c,e,g,h,f,j,i

What are the possible measures of the resulting error, that would allow me to tune the internal parameters of algorithm B to get result closer to the output of A?

回答1:

Spearman's rho

I think what you want is Spearman's rank correlation coefficient. Using the index [rank] vectors for the two sortings (perfect A and approximate B), you calculate the rank correlation rho ranging from -1 (completely different) to 1 (exactly the same):

where d(i) are the difference in ranks for each character between A and B

You can defined your measure of error as a distance D := (1-rho)/2.

回答2:

I would determine the largest correctly ordered sub set.

                               +-------------> I
                               |   +--------->
                               |   |
A -> B -> D ----->  E  -> G -> H --|--> J
     |             ^ |             |    ^
     |             | |             |    |
     +------> C ---+ +-----------> F ---+

In your example 7 out of 10 so the algorithm scores 0.7. The other sets have the length 6. Correct ordering scores 1.0, reverse ordering 1/n.

I assume that this is related to the number of inversions. x + y indicates x <= y (correct order) and x - y indicates x > y (wrong order).

A + B + D - C + E + G + H - F + J - I

We obtain almost the same result - 6 of 9 are correct scorring 0.667. Again correct ordering scores 1.0 and reverse ordering 0.0 and this might be much easier to calculate.

回答3:

Are you looking for finding some algorithm that calculates the difference based on array sorted with A and array sorted with B as inputs? Or are you looking for a generic method of determining on average how off an array would be when sorted with B?

If the first, then I suggest something as simple as the distance each item is from where it should be (an average would do better than a sum to remove length of array as an issue)

If the second, then I think I'd need to see more about these algorithms.

回答4:

It's tough to give a good generic answer, because the right solution for you will depend on your application.

One of my favorite options is just the number of in-order element pairs, divided by the total number of pairs. This is a nice, simple, easy-to-compute metric that just tells you how many mistakes there are. But it doesn't make any attempt to quantify the magnitude of those mistakes.

double sortQuality = 1;
if (array.length > 1) {
   int inOrderPairCount = 0;
   for (int i = 1; i < array.length; i++) {
      if (array[i] >= array[i - 1]) ++inOrderPairCount;
   }
   sortQuality = (double) inOrderPairCount / (array.length - 1);
}

回答5:

Calculating RMS Error may be one of the many possible methods. Here is small python code.

def calc_error(out_A,out_B):
        # in    <= input
        # out_A <= output of algorithm A
        # out_B <= output of algorithm B

        rms_error = 0

        for i in range(len(out_A)):
            # Take square of differences and add
            rms_error +=  (out_A[i]-out_B[i])**2 

        return rms_error**0.5   # Take square root

>>> calc_error([1,2,3,4,5,6],[1,2,3,4,5,6])
0.0
>>> calc_error([1,2,3,4,5,6],[1,2,4,3,5,6]) # 4,3 swapped
1.414
>>> calc_error([1,2,3,4,5,6],[1,2,4,6,3,5]) # 3,4,5,6 randomized
2.44

NOTE: Taking square root is not necessary but taking squares is as just differences may sum to zero. I think that calc_error function gives approximate number of wrongly placed pairs but I dont have any programming tools handy so :(.

Take a look at this question.

回答6:

you could try something involving hamming distance

来源：https://stackoverflow.com/questions/859536/sorted-list-difference

标签

algorithm

language-agnostic

comparison