I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c(
Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.
First create some data of the same size as yours and assign some numeric rownames:
> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050" "22570" "28140" "9576"
Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):
# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))
For timing purposes, try the method in your question (ouch!):
# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
user system elapsed
182.306 0.877 188.362
Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.
FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:
> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.
Now look up row indices instead of row names:
> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
user system elapsed
32.424 0.060 33.050
Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):
> all(rownames(res1)==rownames(res2))
[1] TRUE
So, a ~6x speedup. Still not amazing though...
SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:
map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix
Then use map as before (dat[map[as.numeric(r),]]).