How can correlate against multiple columns using ddply?

感情迁移 提交于 2019-12-13 12:42:45

问题


I have a data.frame and I want to calculate correlation coefficients using one column against the other columns (there are some non-numeric columns in the frame as well).

ddply(Banks,.(brand_id,standard.quarter),function(x) { cor(BLY11,x) })
# Error in cor(BLY11, x) : 'y' must be numeric

I tested against is.numeric(x)

ddply(Banks,.(brand_id,standard.quarter),function(x) { if is.numeric(x) cor(BLY11,x) else 0 })

but that failed every comparison and returned 0 and returned only one column, as if its only being called once. What is being passed to the function? Just coming to R and I think there's something fundamental I'm missing.

Thanks


回答1:


Try something like this one

cor(longley[, 1], longley[ , sapply(longley, is.numeric)])



    GNP.deflator       GNP Unemployed Armed.Forces Population      Year  Employed
[1,]            1 0.9915892  0.6206334    0.4647442  0.9791634 0.9911492 0.9708985



回答2:


From ?cor:

If ‘x’ and ‘y’ are matrices then the covariances (or correlations) between the columns of ‘x’ and the columns of ‘y’ are computed.

So your only real job is to remove the non-numeric columns:

# An example data.frame containing a non-numeric column
d <- cbind(fac=c("A","B"), mtcars)

## Calculate correlations between the mpg column and all numeric columns
cor(d$mpg, as.matrix(d[sapply(d, is.numeric)]))
     mpg       cyl       disp         hp      drat         wt     qsec
[1,]   1 -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
            vs        am      gear       carb
[1,] 0.6640389 0.5998324 0.4802848 -0.5509251

Edit: And in fact, as @MYaseen208's answer shows, there's no need to explicitly convert data.frames to matrices. Both of the following work just fine:

cor(d$mpg, d[sapply(d, is.numeric)])

cor(mtcars, mtcars)



回答3:


This function operates on a chunk:

calc_cor_only_numeric = function(chunk) {
   is_numeric = sapply(chunk, is.numeric)
   return(cor(chunk[-is_numeric]))
 }

And can be used by ddply:

ddply(banks, .(cat1, cat2), calc_cor_only_numeric)

I could not check the code, but this should get you started.




回答4:


ddply splits a data.frame into chunks and sends them (smaller data.frames) to your function. your x is a data.frame with the same columns as Banks. Thus, is.numeric(x) is FALSE. is.data.frame(x) should return TRUE.

try:

function(x) { 
  cor(x$BLY11, x$othercolumnname) 
}



回答5:


It looks like what you're doing can be done with sapply as well:

with(Banks,
  sapply( list(brand_id,standard.quarter), function(x) cor(BLY11,x) )
)


来源:https://stackoverflow.com/questions/12182105/how-can-correlate-against-multiple-columns-using-ddply

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!