How to reference a column in a nested dataframe (then use purrr::map)

混江龙づ霸主 提交于 2020-06-27 04:14:05

问题


I have a very simple question about referencing data columns within a nested dataframe.

For a reproducible example, I'll nest mtcars by the two values of variable am:

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

which gives data that looks like this.

#> # A tibble: 2 x 2
#> # Groups:   am [2]
#>      am data              
#>   <dbl> <list>            
#> 1     1 <tibble [13 × 10]>
#> 2     0 <tibble [19 × 10]>

If I now wanted to use purrr::map to take the mean of mpg for each level of am

I wonder why this doesn't work:


take_mean_mpg <- function(df){
  mean(df[["data"]]$mpg)
}

map(mtcars_nested, take_mean_mpg)
Error in df[["data"]] : subscript out of bounds

Or maybe a simpler question is: How should I properly reference the mpg column, once it's nested. I know that this doesn't work:

mtcars_nested[["data"]]$mpg

回答1:


dataframes (and tbls) are lists of columns, not lists of rows, so when you pass the whole tbl mtcars_nest to map() it is iterating over the columns not over the rows. You can use mutate with your function, and map_dbl so that your new columns is not a list column.

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

take_mean_mpg <- function(df){
  mean(df$mpg)
}

mtcars_nested %>%
  mutate(mean_mpg = map_dbl(.data[["data"]], take_mean_mpg))

The .data[["data"]] argument to map_dbl() gives it the data list column from you dataframe to iterate over, rather than the entire dataframe. The .data part of the argument has no relation to your column named "data", it is the rlang pronoun .data to reference your whole dataframe. [["data"]] then retrieves the column named "data" from your dataframe. You use mutate because you are trying (I assumed, perhaps incorrectly) to add a column with the averages to the nested dataframe. mutate() is used to add columns, so you add a column equal to the output of map() (or map_dbl()) with your function, which will return the list (or vector) of averages.

This can me a confusing concept. Although map() is often used to iterate over the rows of a dataframe, it technically iterates over a list (see the documentation, where under the arguments it says:

.x A list or atomic vector.

It also returns a list or a vector. The good news is that columns are just lists of values, so you pass it the list (column) you want it to iterate over and assign it to the list (column) where you want it stored (this assignment happens with mutate()).




回答2:


You should pass mtcars_nested$data in map and take mean of mpg column.

take_mean_mpg <- function(df){
     mean(df$mpg)
}

purrr::map(mtcars_nested$data, take_mean_mpg)
#[[1]]
#[1] 24.39231

#[[2]]
#[1] 17.14737


来源:https://stackoverflow.com/questions/62408438/how-to-reference-a-column-in-a-nested-dataframe-then-use-purrrmap

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!