Does calculating correlation between two dataframes require a loop?

前端 未结 3 1451
臣服心动
臣服心动 2020-12-11 02:15

I have a set of large dataframes that look like A and B:

A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

  A1 B1 C1
1  1  6 11
2          


        
3条回答
  •  北海茫月
    2020-12-11 02:46

    cor accepts two data.frames:

    A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
    
    B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
    
    cor(A,B)
    
    #           A2        B2       C2
    # A1 0.9481224 0.9190183 0.459588
    # B1 0.9481224 0.9190183 0.459588
    # C1 0.9481224 0.9190183 0.459588
    
    diag(cor(A,B))
    #[1] 0.9481224 0.9190183 0.4595880
    

    Edit:

    Here are some benchmarks:

    Unit: microseconds
                       expr     min       lq   median       uq      max neval
            diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295  352.955   100
          mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260  375.087   100
     unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140   100
    

    Edit2:

    And some better benchmarks using

    set.seed(42)
    A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    

    enter image description here

    However, I should probably mention that these benchmarks strongly depend on the number of rows.

    Edit3: Since I was asked for the benchmarking code, here it is.

    b <- sapply(2^(1:12), function(n) {
        set.seed(42)
        A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
        B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    
        require(microbenchmark)
        res <- print(microbenchmark(
                       diag(cor(A,B)),
                       mapply(cor, A, B),
                       unlist(Map(cor,A,B)),
                       times=10
                     ),unit="us")
        res$median
    })
    
    b <- t(b)
    
    matplot(x=1:12,log10(b),type="l",
            ylab="log10(median [µs])", 
            xlab="log2(n)",col=1:3,lty=1)
    legend("topleft", legend=c("diag(cor(A, B))", 
                               "mapply(cor, A, B)",
                               "unlist(Map(cor,A,B))"),lty=1, col=1:3)
    

提交回复
热议问题