问题
I have been trying to create a new dataframe from several computations with lapply()
. I have reached this so far reading several questions (1, 2, 3):
lapply(mtcars, function(x) c(colnames(x),
NROW(unique(x)),
sum(is.na(x)),
round(sum(is.na(x))/NROW(x),2)
)
)
However, colnames(x)
doesn't give the colname as x
it's a vector. Second, I can't figure out a way to transform this output into a dataframe:
lapply(mtcars, function(x) data.frame(NROW(unique(x)), # if I put colnames(x) here it gives an error
sum(is.na(x)),
round(sum(is.na(x))/NROW(x),2)
)
)
As you might see above, the final dataframe should follow a structure like:
| Variable_name | sum_unique | NA_count | NA_percent |
回答1:
The following will work. First, create a list with each element as a data frame, and then combine all data frames to get the final output.
lst <- lapply(1:ncol(mtcars), function(i){
x <- mtcars[[i]]
data.frame(
Variable_name = colnames(mtcars)[[i]],
sum_unique = NROW(unique(x)),
NA_count = sum(is.na(x)),
NA_percent = round(sum(is.na(x))/NROW(x),2))
})
do.call(rbind, lst)
# Variable_name sum_unique NA_count NA_percent
# 1 mpg 25 0 0
# 2 cyl 3 0 0
# 3 disp 27 0 0
# 4 hp 22 0 0
# 5 drat 22 0 0
# 6 wt 29 0 0
# 7 qsec 30 0 0
# 8 vs 2 0 0
# 9 am 2 0 0
# 10 gear 3 0 0
# 11 carb 6 0 0
Since you tagged this post with tidyverse
, here I provided another alternative that uses map_dfr
, which leads to a more concise code.
library(tidyverse)
map_dfr(mtcars, function(x){
tibble(sum_unique = NROW(unique(x)),
NA_count = sum(is.na(x)),
NA_percent = round(sum(is.na(x))/NROW(x),2))
}, .id = "Variable_name")
# # A tibble: 11 x 4
# Variable_name sum_unique NA_count NA_percent
# <chr> <int> <int> <dbl>
# 1 mpg 25 0 0
# 2 cyl 3 0 0
# 3 disp 27 0 0
# 4 hp 22 0 0
# 5 drat 22 0 0
# 6 wt 29 0 0
# 7 qsec 30 0 0
# 8 vs 2 0 0
# 9 am 2 0 0
# 10 gear 3 0 0
# 11 carb 6 0 0
Finally, another solution using functions from dplyr
and tidyr
.
mtcars %>%
summarize_all(
list(
sum_unique = function(x) NROW(unique(x)),
NA_count = function(x) sum(is.na(x)),
NA_percent = function(x) round(sum(is.na(x))/NROW(x),2)
)
) %>%
pivot_longer(everything(),
names_to = "column",
values_to = "value") %>%
separate(column, into = c("Variable_name", "parameter"), sep = "_", extra = "merge") %>%
pivot_wider(names_from = "parameter", values_from = "value")
# # A tibble: 11 x 4
# Variable_name sum_unique NA_count NA_percent
# <chr> <int> <int> <dbl>
# 1 mpg 25 0 0
# 2 cyl 3 0 0
# 3 disp 27 0 0
# 4 hp 22 0 0
# 5 drat 22 0 0
# 6 wt 29 0 0
# 7 qsec 30 0 0
# 8 vs 2 0 0
# 9 am 2 0 0
# 10 gear 3 0 0
# 11 carb 6 0 0
来源:https://stackoverflow.com/questions/58496094/lapply-output-as-a-dataframe-of-multiple-functions-r