How is NaN handled in Pearson correlation user-user similarity matrix in a recommender system?

问题

I am generating a user-user similarity matrix from a user-rating data (particularly MovieLens100K data). Computing correlation leads to some NaN values. I have tested in a smaller dataset:

User-Item rating matrix

   I1 I2 I3 I4
U1 4  0  5  5  
U2 4  2  1  0  
U3 3  0  2  4  
U4 4  4  0  0

User-User Pearson Correlation similarity matrix

              U1        U2        U3       U4      U5
U1             1        -1         0      -nan  0.755929
U2            -1         1         1      -nan -0.327327
U3             0         1         1      -nan  0.654654
U4          -nan      -nan      -nan      -nan      -nan
U5      0.755929 -0.327327  0.654654      -nan         1

For computing the pearson correlation , only corated items are considered between two users. (See Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin

How can i handle the NaN values?

EDIT Here is a code with which i find pearson correlation in R. The R matrix is the user-item rating matrix. Contains 1 to 5 scale rating 0 means not rated. S is the user-user correlation matrix.

  for (i in 1:nrow (R))
  {
    cat ("user: ", i, "\n");
    for (k in 1:nrow (R))
    {
      if (i != k)
      {
        corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
        ui <- (R[i,corated_list] - mean (R[i,corated_list]));
        uk <- (R[k,corated_list] - mean (R[k,corated_list]));
        temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
        S[i,k] <- ifelse (is.nan (temp), 0, temp)
      }
      else
      {
        S[i,k] <- 0;
      }
    }
  }

Note that in the S[i,k] <- ifelse (is.nan (temp), 0, temp) line i am replacing the NaNs with 0.

回答1:

I recently developed a recommender system in Java for user-user & user-item matrix. Firstly as you probably already have found. RS are difficult. For my implementation I utilised the Apache Common Math Library which is fantastic, you are using R which is probably relatively similar in how it calculates Pearson's.

Your question was: How can I handle NaN values, followed by an edit saying you are saying NaN is = 0.

My answer is this:

You shouldn't really handle NaN values as 0, because what you are saying is that there is absolutely no correlation between users or users/items. This might be the case, but it is likely not always the case. Ignoring this will skew your recommendations.

Firstly you should be asking yourself, "why am I getting NaN values"? Here are some reasons from the Wiki page of NaN detailing why you might get a NaN value:

There are three kinds of operations that can return NaN:

Operations with a NaN as at least one operand.
Indeterminate forms The divisions 0/0 and ±∞/±∞ The multiplications 0×±∞ and ±∞×0 The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions The standard has alternative functions for powers: The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1. The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
Real operations with complex results, for example: The square root of a negative number. The logarithm of a negative number The inverse sine or cosine of a number that is less than −1 or greater than +1.

You should debug your application and step through each step to see which of the above reasons is the offending cause.

Secondly understanding that Pearsons Correlation can be represented in a number of different ways, and you need to consider whether you are calculating it across a sample or population and then find the appropriate method of calculating it i.e. for a population:

cor(X, Y) = Σ[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]

where E(X) is the mean of X, E(Y) is the mean of the Y values and s(X), s(Y) are standard deviations and standard deviations is generally the positive square root of the variance and variance = sum((x_i - mean)^2) / (n - 1)

where mean is the Mean and n is the number of sample observations.

This is probably where your NaN are appearing i.e. dividing by 0 for not rated. If you can I would suggest not using the value of 0 to mean not rated, instead use null. I would do this for 2 reasons: 1. The 0 is probably what is cocking up your results with NaNs, and 2. Readability / Understandability. Your Scale is 1 - 5, so 0 should not feature, confuses things. So avoid that if possible.

Thirdly from a recommender stand point, think about things from a recommendation point of view. If you have 2 users and they only have 1 rating in common, say U1 and U4 for I1 in your smaller dataset. Is that 1 item in common really enough to offer recommendations on? The answer is - of course not. So can I also suggest you set a minimum threshold of ratingsInCommon to ensure that the quality of recommendation is better. The minimum you can set for this threshold is 2, but consider setting it a bit higher. If you read the MovieLens research then they set it to between 5-10 (cant remember off the top of my head). The higher you set this the less coverage you will get but you will achieve "better" (lower error scores) recommendations. You've probably done your reading of the academic literature then you will have probably picked up on this point, but thought I would mention it anyway.

On the above point. Look at U4 and compare with every other User. Notice how U4 does not have more that 1 item in common with any user. Now hopefully you will notice that the NaNs appear exclusively with U4. If you have followed this answer then you will hopefully now see that the reason you are getting NaNs is because you can actually compute Pearson's with just 1 item in common :).

Finally one thing that slightly bothers me about the sample dataset above is number of correlations that are 1's and -1's. Think about what that is actually saying about these users preferences, then sense check them against the actual ratings. E.g. look at U1 and U2 ratings. for Item 1 they have strong positive correlation of 1 (both rated it a 4) then for Item 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), seems strange that Pearson Correlation between these two users is -1 (i.e. their preferences are completely opposite). This is clearly not the case, really the Pearson score should be a bit above or a bit below 0. This issue links back into points about using 0 on the scale and also comparing only a small amount of items together.

Now, there are strategies in place for "filling in" items that users have not rated. I am not going to go into them you need read up on that, but essentially it is like using the average score for that item or the average rating for that user. Both methods have their downsides, and personally I don't really like either of them. My advice is to only calculate Pearson correlations between users when they have 5 or more items in common, and ignore the items where ratings are 0 (or better - null) ratings.

So to conclude.

NaN does not equal 0 so do not set it to 0.
0's in your scale are better represented as null
You should only calculate Pearson Correlations when the number of items in common between two users is >1, preferably greater than 5/10.
Only calculate the Pearson Correlation for two users where they have commonly rated items, do not include items in the score that have not been rated by the other user.

Hope that helps and good luck.

来源：https://stackoverflow.com/questions/11429604/how-is-nan-handled-in-pearson-correlation-user-user-similarity-matrix-in-a-recom

标签

nan

correlation

recommendation-engine

pearson