lapply() output as a dataframe of multiple functions - R

问题

I have been trying to create a new dataframe from several computations with lapply(). I have reached this so far reading several questions (1, 2, 3):

lapply(mtcars, function(x) c(colnames(x), 
                             NROW(unique(x)), 
                             sum(is.na(x)), 
                             round(sum(is.na(x))/NROW(x),2)   
                        )
       )

However, colnames(x) doesn't give the colname as x it's a vector. Second, I can't figure out a way to transform this output into a dataframe:

lapply(mtcars, function(x) data.frame(NROW(unique(x)), # if I put colnames(x) here it gives an error
                                      sum(is.na(x)), 
                                      round(sum(is.na(x))/NROW(x),2)   
                        )
       )

As you might see above, the final dataframe should follow a structure like:

| Variable_name | sum_unique | NA_count | NA_percent |

回答1:

The following will work. First, create a list with each element as a data frame, and then combine all data frames to get the final output.

lst <- lapply(1:ncol(mtcars), function(i){
  x <- mtcars[[i]]
  data.frame(
    Variable_name = colnames(mtcars)[[i]],
    sum_unique = NROW(unique(x)), 
    NA_count = sum(is.na(x)), 
    NA_percent = round(sum(is.na(x))/NROW(x),2))  
  })

do.call(rbind, lst)
#    Variable_name sum_unique NA_count NA_percent
# 1            mpg         25        0          0
# 2            cyl          3        0          0
# 3           disp         27        0          0
# 4             hp         22        0          0
# 5           drat         22        0          0
# 6             wt         29        0          0
# 7           qsec         30        0          0
# 8             vs          2        0          0
# 9             am          2        0          0
# 10          gear          3        0          0
# 11          carb          6        0          0

Since you tagged this post with tidyverse, here I provided another alternative that uses map_dfr, which leads to a more concise code.

library(tidyverse)

map_dfr(mtcars, function(x){
  tibble(sum_unique = NROW(unique(x)), 
         NA_count = sum(is.na(x)), 
         NA_percent = round(sum(is.na(x))/NROW(x),2))
}, .id = "Variable_name")
# # A tibble: 11 x 4
#    Variable_name sum_unique NA_count NA_percent
#    <chr>              <int>    <int>      <dbl>
#  1 mpg                   25        0          0
#  2 cyl                    3        0          0
#  3 disp                  27        0          0
#  4 hp                    22        0          0
#  5 drat                  22        0          0
#  6 wt                    29        0          0
#  7 qsec                  30        0          0
#  8 vs                     2        0          0
#  9 am                     2        0          0
# 10 gear                   3        0          0
# 11 carb                   6        0          0

Finally, another solution using functions from dplyr and tidyr.

mtcars %>%
  summarize_all(
    list(
      sum_unique = function(x) NROW(unique(x)), 
      NA_count = function(x) sum(is.na(x)), 
      NA_percent = function(x) round(sum(is.na(x))/NROW(x),2)
    )
  ) %>%
  pivot_longer(everything(), 
               names_to = "column", 
               values_to = "value") %>%
  separate(column, into = c("Variable_name", "parameter"), sep = "_", extra = "merge") %>%
  pivot_wider(names_from = "parameter", values_from = "value")
# # A tibble: 11 x 4
#    Variable_name sum_unique NA_count NA_percent
#    <chr>              <int>    <int>      <dbl>
#  1 mpg                   25        0          0
#  2 cyl                    3        0          0
#  3 disp                  27        0          0
#  4 hp                    22        0          0
#  5 drat                  22        0          0
#  6 wt                    29        0          0
#  7 qsec                  30        0          0
#  8 vs                     2        0          0
#  9 am                     2        0          0
# 10 gear                   3        0          0
# 11 carb                   6        0          0

来源：https://stackoverflow.com/questions/58496094/lapply-output-as-a-dataframe-of-multiple-functions-r

标签

dplyr

apply

tidyverse

lapply