Functions for creating and reshaping big data in R using the FF package

那年仲夏 提交于 2019-12-01 11:01:30

The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

An example on the Indometh dataset with 1000000 subjects.

require(ffbase)
require(datasets)
data(Indometh)

## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000        3

## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
  df <- reshape(datawithseveralsplitelements, 
              v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
  as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject"   "conc.0.25" "conc.0.5"  "conc.0.75" "conc.1"    "conc.1.25" "conc.2"    "conc.3"    "conc.4"    "conc.5"    "conc.6"    "conc.8"   
dim(result)
[1] 1000000      12

You would be hard put to construct a less efficient method than what you offer. Using rbind.data.frame is incredibly inefficient. Try this instead to create a six thousand line dataset for 6 subjects:

DF <- data.frame( Subj = rep( 1:6, each=1000), matrix(runif(6000*11), nrow=6000) )

Scaling it up to have a billion items (US billion, not UK billion) should give you about an 10GB object, so maybe trying 80 million lines or so?

I think asking for a tutorial in the ff-package is out-of-scope for SO. Please read the FAQ. Such questions are generally closed because the questioner demonstrates that they don't really know what they are talking about.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!