Does calculating correlation between two dataframes require a loop?

Deadly 提交于 2019-11-27 06:43:06

问题


I have a set of large dataframes that look like A and B:

A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

  A1 B1 C1
1  1  6 11
2  2  7 12
3  3  8 13
4  4  9 14
5  5 10 15

B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))

  A2 B2 C2
1  6  2  1
2  7  1  5
3  7  3 16
4 10  8  7
5 11 11  8

I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are:

[1] 0.95 0.92 0.46

回答1:


cor accepts two data.frames:

A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))

cor(A,B)

#           A2        B2       C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588

diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880

Edit:

Here are some benchmarks:

Unit: microseconds
                   expr     min       lq   median       uq      max neval
        diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295  352.955   100
      mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260  375.087   100
 unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140   100

Edit2:

And some better benchmarks using

set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

However, I should probably mention that these benchmarks strongly depend on the number of rows.

Edit3: Since I was asked for the benchmarking code, here it is.

b <- sapply(2^(1:12), function(n) {
    set.seed(42)
    A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

    require(microbenchmark)
    res <- print(microbenchmark(
                   diag(cor(A,B)),
                   mapply(cor, A, B),
                   unlist(Map(cor,A,B)),
                   times=10
                 ),unit="us")
    res$median
})

b <- t(b)

matplot(x=1:12,log10(b),type="l",
        ylab="log10(median [µs])", 
        xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))", 
                           "mapply(cor, A, B)",
                           "unlist(Map(cor,A,B))"),lty=1, col=1:3)



回答2:


You can use friend of apply functions, Map, for that.

Map(function(x,y) cor(x,y),A,B)
$A1
[1] 0.9481224

$B1
[1] 0.9190183

$C1
[1] 0.459588

If you want the output as vector as suggested by @Jilber :

unlist(Map(function(x,y) cor(x,y),A,B))
       A1        B1        C1 
0.9481224 0.9190183 0.4595880

Or you can just use:

 unlist(Map(cor,A,B))
       A1        B1        C1 
0.9481224 0.9190183 0.459588



回答3:


Another alternative you can use mapply function

> mapply(function(x,y) cor(x,y),A,B)
       A1        B1        C1 
0.9481224 0.9190183 0.4595880 

Or just mapply(cor, A, B) as suggested by @Aaron.



来源:https://stackoverflow.com/questions/18535231/does-calculating-correlation-between-two-dataframes-require-a-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!