Does calculating correlation between two dataframes require a loop?

前端 未结 3 1450
臣服心动
臣服心动 2020-12-11 02:15

I have a set of large dataframes that look like A and B:

A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

  A1 B1 C1
1  1  6 11
2          


        
相关标签:
3条回答
  • 2020-12-11 02:43

    Another alternative you can use mapply function

    > mapply(function(x,y) cor(x,y),A,B)
           A1        B1        C1 
    0.9481224 0.9190183 0.4595880 
    

    Or just mapply(cor, A, B) as suggested by @Aaron.

    0 讨论(0)
  • 2020-12-11 02:46

    cor accepts two data.frames:

    A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
    
    B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
    
    cor(A,B)
    
    #           A2        B2       C2
    # A1 0.9481224 0.9190183 0.459588
    # B1 0.9481224 0.9190183 0.459588
    # C1 0.9481224 0.9190183 0.459588
    
    diag(cor(A,B))
    #[1] 0.9481224 0.9190183 0.4595880
    

    Edit:

    Here are some benchmarks:

    Unit: microseconds
                       expr     min       lq   median       uq      max neval
            diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295  352.955   100
          mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260  375.087   100
     unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140   100
    

    Edit2:

    And some better benchmarks using

    set.seed(42)
    A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    

    enter image description here

    However, I should probably mention that these benchmarks strongly depend on the number of rows.

    Edit3: Since I was asked for the benchmarking code, here it is.

    b <- sapply(2^(1:12), function(n) {
        set.seed(42)
        A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
        B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
    
        require(microbenchmark)
        res <- print(microbenchmark(
                       diag(cor(A,B)),
                       mapply(cor, A, B),
                       unlist(Map(cor,A,B)),
                       times=10
                     ),unit="us")
        res$median
    })
    
    b <- t(b)
    
    matplot(x=1:12,log10(b),type="l",
            ylab="log10(median [µs])", 
            xlab="log2(n)",col=1:3,lty=1)
    legend("topleft", legend=c("diag(cor(A, B))", 
                               "mapply(cor, A, B)",
                               "unlist(Map(cor,A,B))"),lty=1, col=1:3)
    
    0 讨论(0)
  • 2020-12-11 02:51

    You can use friend of apply functions, Map, for that.

    Map(function(x,y) cor(x,y),A,B)
    $A1
    [1] 0.9481224
    
    $B1
    [1] 0.9190183
    
    $C1
    [1] 0.459588
    

    If you want the output as vector as suggested by @Jilber :

    unlist(Map(function(x,y) cor(x,y),A,B))
           A1        B1        C1 
    0.9481224 0.9190183 0.4595880
    

    Or you can just use:

     unlist(Map(cor,A,B))
           A1        B1        C1 
    0.9481224 0.9190183 0.459588
    
    0 讨论(0)
提交回复
热议问题