问题
I wanted to sum individual columns by group and my first thought was to use tapply.
However, I cannot get tapply to work. Can tapply be used to sum multiple columns?
If not, why not?
I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.
Below is an example data set for which I wish to sum apples by state, cherries by state
and plums by state. Below that I have compiled numerous alternatives to tapply that
do work.
At the bottom I show a simple modification to the tapply source code that allows
tapply to perform the desired operation.
Nevertheless, perhaps I am overlooking a simple way to perform the desired operation
with tapply. I am not looking for alternative functions, although additional alternatives are welcome.
Given the simplicity of my modification to the tapply source code I wonder why it, or
something similar, has not already been implemented.
Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.
Here is the example data set:
df.1 <- read.table(text = '
state county apples cherries plums
AA 1 1 2 3
AA 2 10 20 30
AA 3 100 200 300
BB 7 -1 -2 -3
BB 8 -10 -20 -30
BB 9 -100 -200 -300
', header = TRUE, stringsAsFactors = FALSE)
This does not work:
tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})
The help pages says:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
X an atomic object, typically a vector.
I was confused by the phrase typically a vector which made me wonder whether
a data frame could be used. I have never been clear on what atomic object means.
Here are several alternatives to tapply that do work. The first alternative is a work-around that combines tapply with apply.
apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), colSums))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
aggregate(df.1[,3:5], by=list(df.1$state), sum)
# Group.1 apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
by(df.1[,3:5], df.1$state, colSums)
# df.1$state: AA
# apples cherries plums
# 111 222 333
# ------------------------------------------------------------
# df.1$state: BB
# apples cherries plums
# -111 -222 -333
with(df.1,
aggregate(x = list(apples = apples,
cherries = cherries,
plums = plums),
by = list(state = state),
FUN = function(x) sum(x)))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )
# $AA
# apples cherries plums
# 111 222 333
#
# $BB
# apples cherries plums
# -111 -222 -333
Here is the source code for tapply except that I changed the line:
nx <- length(X)
to:
nx <- ifelse(is.vector(X), length(X), dim(X)[1])
This modified version of tapply performs the desired operation:
my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- ifelse(is.vector(X), length(X), dim(X)[1]) # replaces nx <- length(X)
one <- 1L
group <- rep.int(one, nx) #- to contain the splitting vector
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx)
stop("arguments must have same length")
namelist[[i]] <- levels(index)#- all of them, yes !
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
ansmat <- array(dim = extent, dimnames = namelist)
ans <- unlist(ans, recursive = FALSE)
} else {
ansmat <- array(vector("list", prod(extent)),
dim = extent, dimnames = namelist)
}
if(length(index)) {
names(ans) <- NULL
ansmat[index] <- ans
}
ansmat
}
my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})
# AA BB
# 111 -111
my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})
# $AA
# apples cherries
# 111 222
#
# $BB
# apples cherries
# -111 -222
回答1:
tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):
> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
apples cherries plums
111 222 333
-------------------------------------------------------------------------------------
df.1$state: BB
apples cherries plums
-111 -222 -333
回答2:
You're looking for by. It uses the INDEX in the way that you assumed tapply would, by row.
by(df.1, df.1$state, function(x) colSums(x[,3:5]))
The problem with your use of tapply is that you were indexing the data.frame by column. (Because data.frame is really just a list of columns.) So, tapply complained that your index didn't match the length of your data.frame which is 5.
回答3:
I looked at the source code for by, as EDi suggested. That code was substantially more complex than my change to the one line in tapply. I have now found that my.tapply does not work with the more complex scenario below where apples and cherries are summed by state and county. If I get my.tapply to work with this case I can post the code here later:
df.2 <- read.table(text = '
state county apples cherries plums
AA 1 1 2 3
AA 1 1 2 3
AA 2 10 20 30
AA 2 10 20 30
AA 3 100 200 300
AA 3 100 200 300
BB 7 -1 -2 -3
BB 7 -1 -2 -3
BB 8 -10 -20 -30
BB 8 -10 -20 -30
BB 9 -100 -200 -300
BB 9 -100 -200 -300
', header = TRUE, stringsAsFactors = FALSE)
# my function works
tapply(df.2$apples , list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$apples , list(df.2$state, df.2$county), function(x) {sum(x)})
# my function works
tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
# my function does not work
my.tapply(df.2[,3:4], list(df.2$state, df.2$county), function(x) {colSums(x)})
来源:https://stackoverflow.com/questions/17903205/sum-multiple-columns-by-group-with-tapply