How to do a Leave One Out cross validation by group / subset?

问题

This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).

I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.

The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?

library(modelr)
install.packages("gapminder")
library(gapminder)                           
data(gapminder) 

#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)

models <- list(fit_model1 = model1,fit_model2 = model2)

gapminder %>% nest_by(continent, country) %>%
  bind_cols(
    map(1:nrow(gapminder), function(i) {
      map_dfc(models, function(model) {
        training <- data[-i, ] 
        fit <- lm(model, data = training)
        
        validation <- data[i, ]
        predict(fit, newdata = validation)
        
      })
    }) %>%
      bind_rows()
  )

回答1:

The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.

For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.

Part I

First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.

FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.

im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE), 
             dimnames=list(NULL, names(FOAE)))

This would look like this:

head(im, n=3)
#      fit1 fit2
# [1,]   NA   NA
# [2,]   NA   NA
# [3,]   NA   NA

Now we loop over formulas and rows as described above.

for (i in seq(FOAE)) {
  for(j in seq(nrow(iris))) {
    train <- iris[-j,]  
    test <- iris[j,]    
    fit <- lm(FOAE[[i]], data=train)
    im[j, i] <- predict(fit, newdata=test)
  }
}

im has now been filled, and we may cbind it to the original iris data set to get our result res1.

res1 <- cbind(iris, im)
head(res1)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).

FUN1 <- Vectorize(function(x, y) {
  train <- iris[-x,]
  test <- iris[x,]
  fit <- lm(y, data=train)
  predict(fit, newdata=test)
})

Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.

o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)

head(res2)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

## test if results are different is negative 
stopifnot(all.equal(res1, res2))

Part II

We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.

FUN2 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)

o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)

This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.

o2
#                fit1     fit2
# setosa     3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica  4.911009 5.976922

res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])

head(res3)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa            5.1         3.5          1.4         0.2  setosa 3.609943 2.662609
# setosa.1          4.9         3.0          1.4         0.2  setosa 3.609943 2.662609
# setosa.2          4.7         3.2          1.3         0.2  setosa 3.609943 2.662609
# setosa.3          4.6         3.1          1.5         0.2  setosa 3.609943 2.662609
# setosa.4          5.0         3.6          1.4         0.2  setosa 3.609943 2.662609
# setosa.5          5.4         3.9          1.7         0.4  setosa 3.609943 2.662609

tail(res3)
#              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species     fit1     fit2
# virginica.44          6.7         3.3          5.7         2.5 virginica 4.911009 5.976922
# virginica.45          6.7         3.0          5.2         2.3 virginica 4.911009 5.976922
# virginica.46          6.3         2.5          5.0         1.9 virginica 4.911009 5.976922
# virginica.47          6.5         3.0          5.2         2.0 virginica 4.911009 5.976922
# virginica.48          6.2         3.4          5.4         2.3 virginica 4.911009 5.976922
# virginica.49          5.9         3.0          5.1         1.8 virginica 4.911009 5.976922

Edit

In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).

FUN3 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  (predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)

o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)

res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 3.706940 2.678255
# setosa.2          4.9         3.0          1.4         0.2  setosa 3.500562 2.547587
# setosa.3          4.7         3.2          1.3         0.2  setosa 3.294183 2.416919
# setosa.4          4.6         3.1          1.5         0.2  setosa 3.190994 2.351586
# setosa.5          5.0         3.6          1.4         0.2  setosa 3.603751 2.612921
# setosa.6          5.4         3.9          1.7         0.4  setosa 4.016508 3.073249

Edit 2

As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.

FUN4 <- Vectorize(function(x, y) {
  ## subsets first by cluster then by row
  ss <- iris[iris$Species %in% x,]  ## cluster subset
  sapply(1:nrow(ss), function(z) {  ## subset rows using `sapply`
    train <- ss[-z,]  ## train data w/o row z
    test <- ss[z,]    ## test data for `predict`, just row z
    fit <- lm(y, data=train)
    predict(fit, newdata=test)
  })
}, SIMPLIFY=F)

## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

## unique cluster names
Species.u <- unique(iris$Species)

## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)

## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))

## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 1.476004 1.451029
# setosa.2          4.9         3.0          1.4         0.2  setosa 1.449120 1.431737
# setosa.3          4.7         3.2          1.3         0.2  setosa 1.426185 1.416492
# setosa.4          4.6         3.1          1.5         0.2  setosa 1.404040 1.398103
# setosa.5          5.0         3.6          1.4         0.2  setosa 1.462460 1.441295
# setosa.6          5.4         3.9          1.7         0.4  setosa 1.504990 1.559045

来源：https://stackoverflow.com/questions/65440030/how-to-do-a-leave-one-out-cross-validation-by-group-subset

标签

validation

group-by

regression