Is there a R function that applies a function to each pair of columns?

二次信任 提交于 2019-11-26 06:38:08

问题


I often need to apply a function to each pair of columns in a dataframe/matrix and return the results in a matrix. Now I always write a loop to do this. For instance, to make a matrix containing the p-values of correlations I write:

df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100))

n <- ncol(df)

foo <- matrix(0,n,n)

for ( i in 1:n)
{
    for (j in i:n)
    {
        foo[i,j] <- cor.test(df[,i],df[,j])$p.value
    }
}

foo[lower.tri(foo)] <- t(foo)[lower.tri(foo)]

foo
          [,1]      [,2]      [,3]
[1,] 0.0000000 0.7215071 0.5651266
[2,] 0.7215071 0.0000000 0.9019746
[3,] 0.5651266 0.9019746 0.0000000

which works, but is quite slow for very large matrices. I can write a function for this in R (not bothering with cutting time in half by assuming a symmetrical outcome as above):

Papply <- function(x,fun)
{
n <- ncol(x)

foo <- matrix(0,n,n)
for ( i in 1:n)
{
    for (j in 1:n)
    {
        foo[i,j] <- fun(x[,i],x[,j])
    }
}
return(foo)
}

Or a function with Rcpp:

library(\"Rcpp\")
library(\"inline\")

src <- 
\'
NumericMatrix x(xR);
Function f(fun);
NumericMatrix y(x.ncol(),x.ncol());

for (int i = 0; i < x.ncol(); i++)
{
    for (int j = 0; j < x.ncol(); j++)
    {
        y(i,j) = as<double>(f(wrap(x(_,i)),wrap(x(_,j))));
    }
}
return wrap(y);
\'

Papply2 <- cxxfunction(signature(xR=\"numeric\",fun=\"function\"),src,plugin=\"Rcpp\")

But both are quite slow even on a pretty small dataset of 100 variables ( I thought the Rcpp function would be faster, but I guess conversion between R and C++ all the time takes its toll):

> system.time(Papply(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))
   user  system elapsed 
   3.73    0.00    3.73 
> system.time(Papply2(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))
   user  system elapsed 
   3.71    0.02    3.75 

So my question is:

  1. Due to the simplicity of these functions I assume this is already somewhere in R. Is there an apply or plyr function that does this? I have looked for it but haven\'t been able to find it.
  2. If so, is it faster?

回答1:


It wouldn't be faster, but you can use outer to simplify the code. It does require a vectorized function, so here I've used Vectorize to make a vectorized version of the function to get the correlation between two columns.

df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100))
n <- ncol(df)

corpij <- function(i,j,data) {cor.test(data[,i],data[,j])$p.value}
corp <- Vectorize(corpij, vectorize.args=list("i","j"))
outer(1:n,1:n,corp,data=df)



回答2:


I'm not sure if this addresses your problem in a proper manner, but take a look at William Revelle's psych package. corr.test returns list of matrices with correlation coefs, # of obs, t-test statistic, and p-value. I know I use it all the time (and AFAICS you're also a psychologist, so it may suite your needs as well). Writing loops is not the most elegant way of doing this.

library(psych)
corr.test(mtcars)
( k <- corr.test(mtcars[1:5]) )
Call:corr.test(x = mtcars[1:5])
Correlation matrix 
       mpg   cyl  disp    hp  drat
mpg   1.00 -0.85 -0.85 -0.78  0.68
cyl  -0.85  1.00  0.90  0.83 -0.70
disp -0.85  0.90  1.00  0.79 -0.71
hp   -0.78  0.83  0.79  1.00 -0.45
drat  0.68 -0.70 -0.71 -0.45  1.00
Sample Size 
     mpg cyl disp hp drat
mpg   32  32   32 32   32
cyl   32  32   32 32   32
disp  32  32   32 32   32
hp    32  32   32 32   32
drat  32  32   32 32   32
Probability value 
     mpg cyl disp   hp drat
mpg    0   0    0 0.00 0.00
cyl    0   0    0 0.00 0.00
disp   0   0    0 0.00 0.00
hp     0   0    0 0.00 0.01
drat   0   0    0 0.01 0.00

str(k)
List of 5
 $ r   : num [1:5, 1:5] 1 -0.852 -0.848 -0.776 0.681 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
 $ n   : num [1:5, 1:5] 32 32 32 32 32 32 32 32 32 32 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
 $ t   : num [1:5, 1:5] Inf -8.92 -8.75 -6.74 5.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
 $ p   : num [1:5, 1:5] 0.00 6.11e-10 9.38e-10 1.79e-07 1.78e-05 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
  .. ..$ : chr [1:5] "mpg" "cyl" "disp" "hp" ...
 $ Call: language corr.test(x = mtcars[1:5])
 - attr(*, "class")= chr [1:2] "psych" "corr.test"



回答3:


92% of the time is being spent in cor.test.default and routines it calls so its hopeless trying to get faster results by simply rewriting Papply (other than the savings from computing only those above or below the diagonal assuming that your function is symmetric in x and y).

> M <- matrix(rnorm(100*300),300,100)
> Rprof(); junk <- Papply(M,function(x,y) cor.test( x, y)$p.value); Rprof(NULL)
> summaryRprof()
$by.self
                 self.time self.pct total.time total.pct
cor.test.default      4.36    29.54      13.56     91.87
# ... snip ...



回答4:


You can use mapply, but as the other answers state its unlikely to be much faster as most of the time is being used up by cor.test.

matrix(mapply(function(x,y) cor.test(df[,x],df[,y])$p.value,rep(1:3,3),sort(rep(1:3,3))),nrow=3,ncol=3)

You could reduce the amount of work mapply does by using the symmetry assumption and noting the zero diagonal, eg

v <- mapply(function(x,y) cor.test(df[,x],df[,y])$p.value,rep(1:2,2:1),rev(rep(3:2,2:1)))
m <- matrix(0,nrow=3,ncol=3)
m[lower.tri(m)] <- v
m[upper.tri(m)] <- v


来源:https://stackoverflow.com/questions/5233308/is-there-a-r-function-that-applies-a-function-to-each-pair-of-columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!