apply a function over groups of columns

后端未结

关注

 6  835

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data

相关标签:

6条回答

無奈伤痛

2020-11-28 12:34

A similar question was asked here by @david: averaging every 16 columns in r (now closed), which I answered by adapting @TylerRinker's answer above, following a suggestion by @joran and @Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.

# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
    # Create index list
    if (length(by) == 1)
    {
        nc <- ncol(x)
        split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
    } else # 'by' is a vector of groups
    {
        nc <- length(by)
        split.index <- by
    }
    index.list <- split(seq(from = 1, to = nc), split.index)

    # Pass index list to fun using sapply() and return object
    sapply(index.list, function(i)
            {
                do.call(fun, list(x[, i], ...))
            })
}

Then, to find the mean of the replicates:

byapply(dat, 3, rowMeans)

Or, perhaps the standard deviation of the replicates:

byapply(dat, 3, apply, 1, sd)

Update

by can also be specified as a vector of groups:

byapply(dat, c(1,1,1,2,2,2), rowMeans)

0 讨论(0)

陌清茗

2020-11-28 12:39
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
```
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-11-28 12:39
Inspired by @joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):

Make a data frame of example data with p cols to simulate a realistic data set (following @TylerRinker's answer above and unlike my poor example in the question)
```
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
```
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
```
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
```
Now use apply and tapply to get row means for each of the groups
```
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
```
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.

Further to @joran's suggestion, here's a data.table solution:
```
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <-  data.frame(t(dat))

n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))

library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
```
Thanks everyone for your quick and patient efforts!
0 讨论(0)
发布评论:

提交评论
- 加载中...

离开以前

2020-11-28 12:43

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

Works if you just have col names too:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

EDIT

Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

EDIT 2 Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

0 讨论(0)

悲&欢浪女

2020-11-28 12:49

mean for rows from vectors a,b,c

 rowMeans(dat[1:3])

means for rows from vectors d,e,f

 rowMeans(dat[4:6])

all in one call you get

results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))

if you only know the names of the columns and not the order then you can use:

rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))

#I dont know how much damage this does to speed but should still be quick

0 讨论(0)

被撕碎了的回忆

2020-11-28 12:54
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
```
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
```
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn
0 讨论(0)
发布评论:

提交评论
- 加载中...