I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and a
I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.
Frequently I use data.frames as one would use lists such as df$var[i]
Here is a made up example:
nrow=function(x){ ##required as I use nrow at times.
if(class(x)=='list') {
length(x[[names(x)[1]]])
}else{
base::nrow(x)
}
}
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
})
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
d=as.list(d) #become a list
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
d=as.data.frame(d) #revert back to data.frame
})
data.frame version:
user system elapsed
0.53 0.00 0.53
list version:
user system elapsed
0.04 0.00 0.03
17x times faster to use a list of vectors than a data.frame.
Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...
For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
class(d)='list'
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
class(d)='data.frame'
})
head(d)