Modeling a very big data set (1.8 Million rows x 270 Columns) in R

问题

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)

I've tried using ff and bigglm packages for handling the data.

But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb". So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.

Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?

**EDITS**:

I am not using any other program when I am running the code. The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.

I am adding some of the columns which I am working with now as suggested by the commenters for reproduction. OPEN_FLG is the DV and others are IDVs

str(x[1:10,])
'data.frame':   10 obs. of  270 variables:
 $ OPEN_FLG                   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1    
 $ new_list_id                : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1    
 $ new_mailing_id             : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1    
 $ NUM_OF_ADULTS_IN_HHLD      : num  3 2 6 3 3 3 3 6 4 4    
 $ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5    
 $ OCCUP_DETAIL               : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2    
 $ OCCUP_MIX_PCT              : num  0 0 0 0 0 0 0 0 0 0    
 $ PCT_CHLDRN                 : int  28 37 32 23 36 18 40 22 45 21   
 $ PCT_DEROG_TRADES           : num  41.9 38 62.8 2.9 16.9 ...    
 $ PCT_HOUSEHOLDS_BLACK       : int  6 71 2 1 0 4 3 61 0 13    
 $ PCT_OWNER_OCCUPIED         : int  91 66 63 38 86 16 79 19 93 22    
 $ PCT_RENTER_OCCUPIED        : int  8 34 36 61 14 83 20 80 7 77    
 $ PCT_TRADES_NOT_DEROG       : num  53.7 55 22.2 92.3 75.9 ...    
 $ PCT_WHITE                  : int  69 28 94 84 96 79 91 29 97 79    
 $ POSTAL_CD                  : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577    
 $ PRES_OF_CHLDRN_0_3         : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4    
 $ PRES_OF_CHLDRN_10_12       : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3    
 [list output truncated]

And this is the example of code which I am using.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)

The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".

Please let me know if this is enough or should I include anymore details about the problem.

回答1:

I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf. If you use ffbase, you can use the following:

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

Explanation: Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.

If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")

## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

And off you go.

来源：https://stackoverflow.com/questions/17295423/modeling-a-very-big-data-set-1-8-million-rows-x-270-columns-in-r

标签

classification

bigdata