问题
Consider the sample data below:
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
Objective to find correlation between 2 columns where NA
should reduce the correlation. NA
means that an event did not take place.
Is there a way to use NA
in the correlation such that it pulls down the value of the correlation?
> cor(df$a, df$b)
[1] NA
Or should I be looking at some other mathematical function?
回答1:
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
Here is a way to use NA values to decrease correlation. For demonstration, I am using different data with some good size.
a <- sort(ruinf(10))
b <- sort(ruinf(10))
## Sorting so that there is some good correlation between them.
## Now making some values NA deliberately
a[c(9,10)] <- NA
cor(a[1:8],b[1:8])
## [1] 0.890465 #correlation value is high
## Lets assign a to c and Fill NA values with something
c <- a
## using mean causes no change to numerator but increases denominator.
c[is.na(a)] <- mean(a, na.rm=T) cor(c,b)
## [1] 0.6733387
Note that when you replace all NA terms with mean, the numerator has no change as there is multiplication with zero in additional terms. The denominator however adds some more values for b
so that correlation value comes down. Also, the more NA
in your data, more the correlation comes down.
回答2:
The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.
You may replace the NA
values with something like @Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function
Look at the help file for cor ?cor
and using functions like cor(df$a,df$b,use="pairwise.complete.obs"
you can see how NA
values should usually be treated where they are just removed and have no impact on the correlation itself
?cor output
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value
"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.
回答3:
I guess, there is no simple explanation. . You have to remove data with NA, and ofcourse corresponding data in columns b,c,d. And then compute correlation. You can check if thera are corrensponding NA in each dataset (a,b,c,d)
In yours example you can compute corelation with all combinations of b,c,d, but if you want compute cor for cor(a,b) you have to pick only rows that are without NA in a and b. And maybe when you compute this cor(a,b) multiply it by (number of rows with NA in a and b) divided by number of all rows in dataset
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
来源:https://stackoverflow.com/questions/36009475/reducing-correlation-of-datasets-with-na