问题
I'm using Mahout's EuclideanDistanceSimilarity class to rank the similarity of several users given the following data set of user preferences. The range for preferences is currently all integers from 1 to 5 inclusive. However I have control over the scale, so that can change if it would help.
User Preferences:
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6
1 2 4 3 5 1 2
2 5 1 5 1 5 1
3 1 5 1 5 1 5
4 2 4 3 5 1 2
5 3 3 4 5 2 2
I'm getting unexpected results when I run the following test code, which I added to the Test class found here: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java.html
@Test
public void testSimple2() throws Exception {
DataModel dataModel = getDataModel(
new long[]{1, 2, 3, 4, 5},
new Double[][]{
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
{1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
for (int i = 1; i <= 5; i++) {
for (int j = 1; j <= 5; j++) {
System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
}
}
}
It produces the following results:
1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0
Would someone please help me understand what I'm doing wrong here? Clearly, user 1's preferences are not identical to users 3 & 5, so why do I get 1.0 for the similarity?
I'm open to using a different algorithm if Euclidean won't work, however Pearson doesn't work for me because I need to handle users that submit identical preferences for each item and I do not want to correct for "grade inflation."
回答1:
It is a little weird but I can explain what's happening.
The Euclidean distance d can't be used as a similarity metric directly since it gets bigger with "less similarity". You could use 1/d, but then perfect matches result in infinity, not 1. You can use 1/(1+d).
The problem is that the distance can only be calculated over dimensions that both users have in common. More dimensions typically means more distance. So it's penalizing overlap, the opposite of what you'd expect.
So the formula is really n/(1+d), where n is the number of dimensions of overlap. That results in a similarity greater than 1, which is capped back to 1, in some cases.
n is not the right factor. It's an old simple kludge. I will ask on the mailing list about the right-er expression. For large data, this tends to work OK though.
来源:https://stackoverflow.com/questions/7821944/apache-mahout-euclidean-distance-unexpected-results