问题
I am trying to make a table of all my numerical variables (i.e. feature) in the following format:
Feature | Count | % Missing | Cardinality | Min. | 1st Quartile | Mean | Median | 3rd Quartile | Max. | Std. Dev. |
--------|-------|-----------|-------------|------|--------------|------|--------|--------------|------|-----------| | | | | | | | | | | |
So each row signifies a specific numeric variable and each column the statistics shown above (Count, % Missing, Cardinality, Min., 1st Quartile, Mean, Median, 3rd Quartile, Max. Std. Dev.)
Say my dataset is called Mashable and my numerical variables are called X, Y and Z. How would I create this table?
Thanks in advance!
回答1:
If you're using dplyr
already, you can make use of long shaped data and grouping, and treat all the functions you need as summarizations. That lets you scale easily, so it's the same workflow for 3 variables as it is for 25 or 100. It also makes it relatively quick to apply whatever functions you want.
I made dummy data with x, y, and z, then bound onto it a couple rows of NA
s just to show the missing value count. Gather it to long data, group by the variable, then use whatever summary functions you want. I started out the first several you named. This gives you the format you asked for.
library(tidyverse)
tibble(
x = rnorm(100, mean = 1, sd = 1),
y = rnorm(100, mean = 10, sd = 1),
z = rexp(100, rate = 0.01)
) %>%
bind_rows(tibble(x = c(NA, NA), y = c(NA, NA), z = c(NA, NA))) %>%
gather(key = variable, value = value) %>%
group_by(variable) %>%
summarise(
count = n(),
missing = sum(is.na(value)),
share_missing = missing / count,
mean = mean(value, na.rm = T),
sd = sd(value, na.rm = T),
q1 = quantile(value, 0.25, na.rm = T)
)
#> # A tibble: 3 x 7
#> variable count missing share_missing mean sd q1
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 x 102 2 0.0196 0.997 1.08 0.246
#> 2 y 102 2 0.0196 9.81 0.962 9.10
#> 3 z 102 2 0.0196 106. 90.6 39.9
Created on 2018-05-20 by the reprex package (v0.2.0).
回答2:
You can use dplyr
and tidyr
packages transform your data. Let me take an example to demonstrate transformation for 3 statistics(e.g. mean,median,sd
). You can add other statistics like count
, % missing
etc similar way.
# Sample data
df <- data.frame(X = 1:10, Y = 11:20, Z = 101:110)
library(dplyr)
library(tidyr)
df %>% summarize_all(funs(mean,median,sd)) %>% #Statistics for mean, median,sd
gather(key, value) %>% #Convert to long format
separate(key, c("feature", "stat")) %>% #separate out feature from function
spread(stat, value)
# feature mean median sd
# 1 X 5.5 5.5 3.02765
# 2 Y 15.5 15.5 3.02765
# 3 Z 105.5 105.5 3.02765
回答3:
If X, Y and Z are the columns of your dataset, then you could just use the apply
function to each column in your dataset in R and this will print each of your required statistics e.g.:
apply(dat,2,function(x) c(length(which(is.na(x)==TRUE)),summary(x)))
I mean in general you would specify the data-frame in the first argument. Then select the rows/columns or list (e.g. 2 is column, 1 is row for a matrix/df) depending on the type of structure you are using the function on. And finally a function to apply to each of those rows/columns of list as specified in the prior argument. In this case we used the c
function to output both the length of NA and a summary of the data (min, max, quartiles). By extension we can specify any argument here, so in general:
apply("Insert your data-frame","What part of the data-frame",function(x) c(a(x),b(x),...))
It would be preferable to read the R help file for any additional requirements!
Hope it helps; Cheers.
来源:https://stackoverflow.com/questions/50432751/creating-table-of-variables-with-specific-summary-statistics