I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c(
My original post started with this erroneous statement:
The problem with indexing via
rownamesandcolnamesis that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".
Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.
Note that the suggestions in this answer assume that you have non-overlapping subsets of data.
If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.
An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:
dat <- data.frame(a=sample(100, 10),
b=rnorm(10),
group=sample(c('a', 'b', 'c'), 10, replace=TRUE))
You could then do:
split(dat, dat$group)
$a
a b group
2 66 -0.08721261 a
9 62 -1.34114792 a
$b
a b group
1 32 0.9719442 b
5 79 -1.0204179 b
6 83 -1.7645829 b
7 73 0.4261097 b
10 44 -0.1160913 b
$c
a b group
3 77 0.2313654 c
4 74 -0.8637770 c
8 29 1.0046095 c
Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:
library(data.table)
dat <- data.table(dat, key="group")
Now do your list thing -- which will give you the same result as the split above
x <- lapply(unique(dat$group), function(g) dat[J(g),])
But you probably want to "work over your spits", and you can do that inline, eg:
ans <- dat[, {
## do some code over the data in each split
## and return a list of results, eg:
list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]
ans
group nrow mean.a mean.b
[1,] a 2 64.0 -0.7141803
[2,] b 5 62.2 -0.3006076
[3,] c 3 60.0 0.1240660
You can do the last step in "a similar fashion" with plyr, eg:
library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
mean.b=mean(b))
group nrow mean.a mean.b
1 a 2 64.0 -0.7141803
2 b 5 62.2 -0.3006076
3 c 3 60.0 0.1240660
But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.