R memory management advice (caret, model matrices, data frames)

前端 未结 3 437
醉酒成梦
醉酒成梦 2020-12-16 05:48

I\'m running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:

> dim(basetrainf) # this is a dataframe
[1] 5         


        
相关标签:
3条回答
  • 2020-12-16 06:12

    Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

    Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.

    At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

    0 讨论(0)
  • 2020-12-16 06:13

    You can try to use the ff package, which implements "memory-efficient storage of large data on disk and fast access functions".

    0 讨论(0)
  • 2020-12-16 06:20

    With that much data, the resampled error estimates and the random forest OOB error estimates should be pretty close. Try using trainControl(method = "OOB") and train() will not fit the extra models on resampled data sets.

    Also, avoid the formula interface like the plague.

    You also might try bagging instead. Since there is no random selection of predictors at each spit, you can get good result with 50-100 resamples (instead of many more needed by random forests to be effective).

    Others may disagree, but I also think that modeling all the data you have is not always the best approach. Unless the predictor space is large, many of the data points will be very similar to others and don't contribute much to the model fit (besides the additional computation complexity and the footprint of the resulting object). caret has a function called maxDissim that might be helpful to thinning the data (although it is not terribly efficient either)

    0 讨论(0)
提交回复
热议问题