Similarity between two data sets or arrays

问题

Let's say I have a dataset that look like this:

{A:1, B:3, C:6, D:6}

I also have a list of other sets to compare my specific set:

{A:1, B:3, C:6, D:6},  
{A:2, B:3, C:6, D:6},  
{A:99, B:3, C:6, D:6},  
{A:5, B:1, C:6, D:9},  
{A:4, B:2, C:2, D:6}

My entries could be visualized as a Table (with four columns, A, B, C, D, and E).

How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away.

I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation value for the entries with the best deltas.

Is this a valid way? And what is the name of this problem?

回答1:

Yes, that should work fairly well.

In mathematical terms, it would be: ∑_{x ∈ (a,b,c,d)} Abs(x₁ - x₂)

Perhaps ratio might be a better idea, depending on whether or not that's something you want.

Consider something like 1000000, 5, 5, 5 vs 999995, 5, 5, 5 and 1000000, 0, 5, 5.

According to the above formula, the first would have the same similarity to both the second and the third.

If this is not desired (as 999995 can be considered pretty close to 1000000, while 0 can be thought of as quite far from 5), you should divide by the maximum of the two when calculating each distance.

∑_{x ∈ (a,b,c,d)} [ Abs(x₁ - x₂) / max(x₁, x₂) ]

This will put every number between 0 and 1, which is the percentage difference between the values.

This means that, for our above example, we'd consider 1000000, 5, 5, 5 and 999995, 5, 5, 5 to be very similar (since the above sum will be |1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005) and 1000000, 5, 5, 5 and 1000000, 0, 5, 5 will be considered much more different (since the sum will be |0+5|/5 + 0 + 0 + 0 = 1).

回答2:

Your problem reminds me of finding a Hamming distance. Basically, the Hamming distance between two objects is the number of elements in one object that must be changed to make it match the other object. There are similar measures as well (Damerau–Levenshtein distance, Euclidean distance, etc.).

You have a number of choices in how you implement this. For instance, is the distance between {1,3,4} and {1,7,4} 1 (because one element changed) or 4 (because of the magnitude of the change)? How you actually define the distance depends a lot on the context of your problem, and there's not necessarily a right answer.

来源：https://stackoverflow.com/questions/19815335/similarity-between-two-data-sets-or-arrays

标签

algorithm

correlation

similarity