I am using by
to apply a function to a range columns of a data frame based on a factor. Everything works perfectly well if I use mean()
You are using a split-apply strategy when you use by
. The objects being passed to the function are dataframes and you are getting the warning and error because of the non-existence of median.data.frame
and the impending non-existence of mean.data.frame
. It might work better if you used aggregate
:
> aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T))
Species Sepal.Length Sepal.Width Petal.Length
1 setosa 5.006 3.428 1.462
2 versicolor 5.936 2.770 4.260
3 virginica 6.588 2.974 5.552
> aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))
Species Sepal.Length Sepal.Width Petal.Length
1 setosa 5.0 3.4 1.50
2 versicolor 5.9 2.8 4.35
3 virginica 6.5 3.0 5.55
aggregate
works on the column vectors individually and then tabulates the results.
The original question is answered. If, however, the range happens to be (instead) all columns except those specified as the independent variable in the formula, the dot formula notation works, and represents a nifty alternative:
> aggregate(. ~ Species, data = iris, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> aggregate(. ~ Species, data = iris, median)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 3.4 1.50 0.2
2 versicolor 5.9 2.8 4.35 1.3
3 virginica 6.5 3.0 5.55 2.0