advice on Usage of dplyr:: do vs purrr: map, tidy::nest, for predictions

I just came across the the purrr package and I think this would help me out a bit in terms of what I want to do - I just can't put it together.

I think this is going to be along post but goes over a common use case I think many others run into so hopefully this is of use to them as well.

This is what I'm aiming for:

From one big dataset run multiple models on each of the different subgroups.
Have these models readily available so I can examine - for coeffients, accuracy, etc.
From this saved model list for each of the different groupings, be able to apply the corresponding model to the corresponding test-set group.

grouping_vals = c("cyl", "vs")

library(purrr)
library(dplyr)
set.seed(1)
train=mtcars
noise = sample(1:5,32, replace=TRUE) 
test = mtcars %>% mutate( hp = hp * noise) # just so dataset isn't identical


models = train %>% 
group_by_(grouping_vals) %>%
do(linear_model1 = lm(mpg ~hp, data=.),
   linear_model2 = lm(mpg ~., data=.)
)

I've gotten this far but I don't know how to 'map' the corresponding models to the "test" dataset for the corresponding grouped values.
Now I also might be trying to get the residuals from the training of the linear_model1 or linear_model2 with the training-data for the corresponding groups.

models$linear_model1[[2]]$residuals will show me the residuals for the 2nd grouping of model1. I just don't know how move say all of models$linear_model1 $residuals over to the train dataset.

My understanding is that tidyr's nest() function is doing the same thing that occurs when I create my do() create of the models.

   models_with_nest =  train %>% 
     group_by_(grouping_vals) %>%
                   nest() %>%
     mutate( linear_model2 = purrr::map(data, ~lm(mpg~., data=.)),
             linear_model1 = purrr::map(data, ~lm(mpg~ hp+disp, data=.))
     )

Again just look for a way to easily be able to 'map' these residuals/training predictions to the training dataset and apply then apply the corresponding model to an unseen test dataset like the one I created above.

I hope this isn't confusing since I see a lot of promise here I just can't figure out how to put it together.

I figure this is a task that a ton of people would like to be able to do in this more 'automated' way but instead is something that people do very slowly and step by step.

I'm really interested in finding out differences between the do and the nest, map approaches. Maybe people have tried both and they can comment in which is faster when dealing with much bigger datasets, or much more models.

So far I've been using the do approach as follows:

library(tidyverse)

# reproducible results
set.seed(47)

# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)

# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]

# for each cyl group create subsets and fit the models of interest using do
dt_models = mtcars_train %>%
  group_by(cyl) %>%
  do(model1 = lm(disp ~ hp, data = .),
     model2 = lm(disp ~ mpg, data = .)) %>%
  ungroup %>%
  print()

# reshape model dataset (for easier use later)
dt_models = dt_models %>% gather("name","model", -cyl) %>% print()

# function to pick model and predict corresponding data (row)
GetModelAndPredict = function(input_cyl, model_name, dd){

  m = (dt_models %>% filter(cyl==input_cyl & name==model_name))$model[[1]]

  predict.lm(m, newdata=dd)

}

# predict each row using the corresponding model
mtcars_test %>%
  rowwise() %>%
  do(data.frame(.,
                pred1 = GetModelAndPredict(.$cyl, "model1", .),
                pred2 = GetModelAndPredict(.$cyl, "model2", .))) %>%
  ungroup


# # A tibble: 12 × 13
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     pred1     pred2
# *  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
# 1   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 103.11501 115.24903
# 2   17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 356.19839 316.20091
# 3   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 200.10912 151.56750
# 4   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 195.69767 198.89904
# 5   32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1  87.99347  77.54320
# 6   26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 101.99490 102.68042
# 7   15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 365.97745 339.57501
# 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2  85.75324 108.96473
# 9   27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1  87.99347  97.57442
# 10  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1  87.43341  71.65166
# 11  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 104.23513 115.24903
# 12  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 355.61630 294.38507

But I found really interesting the nest, map approach as well:

library(tidyverse)

# reproducible results
set.seed(47)

# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)

# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]

# for each cyl group create subsets and fit the models of interest using map
dt_models = mtcars_train %>%
  nest(-cyl) %>%
  mutate(model1 = map(data, ~lm(disp ~ hp, data = .)),
         model2 = map(data, ~lm(disp ~ mpg, data = .))) %>%
  rename(data_train = data) %>%
  print()

# join test data to be able to predict them
dt_models_and_test_data = mtcars_test %>%
  nest(-cyl) %>%
  inner_join(dt_models, by = "cyl") %>%
  rename(data_test = data) %>%
  print()

# predict test data using map2
dt_preds = dt_models_and_test_data %>%
  mutate(pred1 = map2(model1, data_test, predict.lm),
         pred2 = map2(model2, data_test, predict.lm)) %>%
  print()

# go back to a reasonable data frame using unnest on columns of interest
dt_preds_upd = dt_preds %>%
  unnest(data_test,pred1,pred2) %>%
  print()


# # A tibble: 12 × 13
#      cyl     pred1     pred2   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl>     <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1      4 103.11501 115.24903  22.8 108.0    93  3.85 2.320 18.61     1     1     4     1
# 2      4  87.99347  77.54320  32.4  78.7    66  4.08 2.200 19.47     1     1     4     1
# 3      4 101.99490 102.68042  26.0 120.3    91  4.43 2.140 16.70     0     1     5     2
# 4      4  85.75324 108.96473  24.4 146.7    62  3.69 3.190 20.00     1     0     4     2
# 5      4  87.99347  97.57442  27.3  79.0    66  4.08 1.935 18.90     1     1     4     1
# 6      4  87.43341  71.65166  33.9  71.1    65  4.22 1.835 19.90     1     1     4     1
# 7      4 104.23513 115.24903  22.8 140.8    95  3.92 3.150 22.90     1     0     4     2
# 8      8 356.19839 316.20091  17.3 275.8   180  3.07 3.730 17.60     0     0     3     3
# 9      8 365.97745 339.57501  15.8 351.0   264  4.22 3.170 14.50     0     1     5     4
# 10     8 355.61630 294.38507  18.7 360.0   175  3.15 3.440 17.02     0     0     3     2
# 11     6 200.10912 151.56750  18.1 225.0   105  2.76 3.460 20.22     1     0     3     1
# 12     6 195.69767 198.89904  21.0 160.0   110  3.90 2.875 17.02     0     1     4     4

来源：https://stackoverflow.com/questions/38621556/advice-on-usage-of-dplyr-do-vs-purrr-map-tidynest-for-predictions

标签

for-loop

dplyr

tidyr

purrr