apply a function over groups of columns

后端 未结 6 835
梦谈多话
梦谈多话 2020-11-28 12:18

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data

相关标签:
6条回答
  • 2020-11-28 12:34

    A similar question was asked here by @david: averaging every 16 columns in r (now closed), which I answered by adapting @TylerRinker's answer above, following a suggestion by @joran and @Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.

    # Function to apply 'fun' to object 'x' over every 'by' columns
    # Alternatively, 'by' may be a vector of groups
    byapply <- function(x, by, fun, ...)
    {
        # Create index list
        if (length(by) == 1)
        {
            nc <- ncol(x)
            split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
        } else # 'by' is a vector of groups
        {
            nc <- length(by)
            split.index <- by
        }
        index.list <- split(seq(from = 1, to = nc), split.index)
    
        # Pass index list to fun using sapply() and return object
        sapply(index.list, function(i)
                {
                    do.call(fun, list(x[, i], ...))
                })
    }
    

    Then, to find the mean of the replicates:

    byapply(dat, 3, rowMeans)
    

    Or, perhaps the standard deviation of the replicates:

    byapply(dat, 3, apply, 1, sd)
    

    Update

    by can also be specified as a vector of groups:

    byapply(dat, c(1,1,1,2,2,2), rowMeans)
    
    0 讨论(0)
  • 2020-11-28 12:39

    The rowMeans solution will be faster, but for completeness here's how you might do this with apply:

    t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
    
    0 讨论(0)
  • 2020-11-28 12:39

    Inspired by @joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):

    Make a data frame of example data with p cols to simulate a realistic data set (following @TylerRinker's answer above and unlike my poor example in the question)

    p <- 99 # how many columns?
    dat <- data.frame(matrix(rnorm(4*p), ncol = p))
    

    Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)

    n <- 3 # how many consecutive columns in the groups of interest?
    names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
    

    Now use apply and tapply to get row means for each of the groups

    dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
    

    The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.

    Further to @joran's suggestion, here's a data.table solution:

    p <- 99 # how many columns?
    dat <- data.frame(matrix(rnorm(4*p), ncol = p))
    dat.t <-  data.frame(t(dat))
    
    n <- 3 # how many consecutive columns in the groups of interest?
    dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
    
    library(data.table)
    DT <- data.table(dat.t)
    setkey(DT, groups)
    dat.av <- DT[, lapply(.SD,mean), by=groups]
    

    Thanks everyone for your quick and patient efforts!

    0 讨论(0)
  • 2020-11-28 12:43

    This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

    x <- list(1:3, 4:6)
    do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
    

    Works if you just have col names too:

    x <- list(c('a','b','c'), c('d', 'e', 'f'))
    do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
    

    EDIT

    Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

    dat <- data.frame(matrix(rnorm(16*100), ncol=100))
    
    n <- 1:ncol(dat)
    ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
    ind <- data.frame(t(na.omit(ind)))
    do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
    

    EDIT 2 Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

    n <- 1:ncol(dat)
    ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
    nonna <- sapply(ind, function(x) all(!is.na(x)))
    ind <- ind[, nonna]
    
    do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
    
    0 讨论(0)
  • 2020-11-28 12:49

    mean for rows from vectors a,b,c

     rowMeans(dat[1:3])
    

    means for rows from vectors d,e,f

     rowMeans(dat[4:6])
    

    all in one call you get

    results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
    

    if you only know the names of the columns and not the order then you can use:

    rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
    rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
    
    #I dont know how much damage this does to speed but should still be quick
    
    0 讨论(0)
  • 2020-11-28 12:54

    There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.

    combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
    

    To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

    0 讨论(0)
提交回复
热议问题