Apache Mahout + Euclidean Distance: Unexpected Results

醉酒当歌 提交于 2019-12-11 05:59:02

问题


I'm using Mahout's EuclideanDistanceSimilarity class to rank the similarity of several users given the following data set of user preferences. The range for preferences is currently all integers from 1 to 5 inclusive. However I have control over the scale, so that can change if it would help.

User    Preferences:
        Item 1    Item 2    Item 3    Item 4    Item 5    Item 6
 1       2         4         3         5         1         2
 2       5         1         5         1         5         1
 3       1         5         1         5         1         5
 4       2         4         3         5         1         2
 5       3         3         4         5         2         2

I'm getting unexpected results when I run the following test code, which I added to the Test class found here: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java.html

@Test
public void testSimple2() throws Exception {
    DataModel dataModel = getDataModel(
            new long[]{1, 2, 3, 4, 5},
            new Double[][]{
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
                {1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
    for (int i = 1; i <= 5; i++) {
        for (int j = 1; j <= 5; j++) {
            System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
        }
    }
}

It produces the following results:

1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0

Would someone please help me understand what I'm doing wrong here? Clearly, user 1's preferences are not identical to users 3 & 5, so why do I get 1.0 for the similarity?

I'm open to using a different algorithm if Euclidean won't work, however Pearson doesn't work for me because I need to handle users that submit identical preferences for each item and I do not want to correct for "grade inflation."


回答1:


It is a little weird but I can explain what's happening.

The Euclidean distance d can't be used as a similarity metric directly since it gets bigger with "less similarity". You could use 1/d, but then perfect matches result in infinity, not 1. You can use 1/(1+d).

The problem is that the distance can only be calculated over dimensions that both users have in common. More dimensions typically means more distance. So it's penalizing overlap, the opposite of what you'd expect.

So the formula is really n/(1+d), where n is the number of dimensions of overlap. That results in a similarity greater than 1, which is capped back to 1, in some cases.

n is not the right factor. It's an old simple kludge. I will ask on the mailing list about the right-er expression. For large data, this tends to work OK though.



来源:https://stackoverflow.com/questions/7821944/apache-mahout-euclidean-distance-unexpected-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!