Am I using plyr right? I seem to be using way too much memory

问题

I have the following, somewhat large dataset:

 > dim(dset)
 [1] 422105     25
 > class(dset)
 [1] "data.frame"
 >

Without doing anything, the R process seems to take about 1GB of RAM.

I am trying to run the following code:

  dset <- ddply(dset, .(tic), transform,
                date.min <- min(date),
                date.max <- max(date),
                daterange <- max(date) - min(date),
                .parallel = TRUE)

Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?

回答1:

If performance is an issue, it might be a good idea to switch to using data.tables from the package of the same name. They are fast. You'd do something roughly equivalent something like this:

library(data.table)
dat <- data.frame(x = runif(100),
                  dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
                  grp = rep(letters[1:4],each = 25))

dt <- as.data.table(dat)
key(dt) <- "grp"

dt[,mutate(.SD,date.min = min(dt),
               date.max = max(dt),
               daterange = max(dt) - min(dt)), by = grp]

回答2:

Here's an alternative application of data.table to the problem, illustrating how blazing-fast it can be. (Note: this uses dset, the data.frame constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic).

(The reason this is much faster than @joran's solution, is that it avoids the use of .SD, instead using the columns directly. The style's a bit different than plyr, but it typically buys huge speed-ups. For another example, see the data.table Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD).

library(data.table)
system.time({
    dt <- data.table(dset, key="tic")
    # Summarize by groups and store results in a summary data.table
    sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
    sumdt[, daterange:= max.date-min.date]
    # Merge the summary data.table back into dt, based on key
    dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user  system elapsed 
# 1.45    0.25    1.77

回答3:

A couple of things come to mind.

First, I would write it as:

dset <- ddply(dset, .(tic), summarise,
                date.min = min(date),
                date.max = max(date),
                daterange = max(date) - min(date),
                .parallel = TRUE)

Well, actually, I would probably avoid double calculating min/max date and write

dset <- ddply(dset, .(tic), function(DF) {
              mutate(summarise(DF, date.min = min(date),
                               date.max = max(date)),
                     daterange = date.max - date.min)},
              .parallel = TRUE)

but that's not the main point you are asking about.

With a dummy data set of your dimensions

n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
    tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
    dset[i] <- rnorm(n)
}

this ran comfortably (sub 1 minute) on my laptop. In fact the plyr step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.

A second possibility is if there are a large number of unique values of tic. That could increase the size needed. However when I tried it increasing the possible number of unique tic values to 1000, it didn't really slow down.

Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach, so it was just doing a serial approach. Perhaps that is causing your memory explosion.

回答4:

Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:

dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)

Then assign the factors you actually need.

I haven't look into Hadley's source yet to discover why.

来源：https://stackoverflow.com/questions/8454019/am-i-using-plyr-right-i-seem-to-be-using-way-too-much-memory

标签

plyr

data.table