问题
I have a data set that contains answers to a "select as many as apply" question with each possible answer in separate columns. So, assuming our question is "What color shirt is acceptable to you?" it looks something like this:
id Q3_Red Q3_Blue Q3_Green Q3_Purple
9
8 Green Purple
7 Green
6 Red
5 Purple
4 Blue
3 Blue Purple
2 Red Blue Green
1 Red Purple
10 Red Purple
Which you can make into an actual data frame with:
tmp <- data.frame("id" = c(009,008,007,006,005,004,003,002,001,010), "Q3_Red" = c("","","","Red","","","","Red","Red","Red"), "Q3_Blue" = c("","","","","","Blue","Blue","Blue","",""),
"Q3_Green" = c("","Green","Green","","","","","Green","",""),
"Q3_Purple" = c("","Purple","","","Purple","","Purple","","Purple","Purple")
)
I want to summarize it with a count of each answer, eg.
Red 4
Blue 3
Green 3
Purple 5
I can get a count of each with something like tmp %>% count(Q3_Red)
and organize those into their own data frame but it seems like there must be a way to use a reshape function to do this in one fell swoop. I've looked at gather()
and spread()
but I can't wrap my head around how I would combine tidyr
with count()
.
回答1:
dplyr
and tidyr
are your friend here:
library(dplyr)
library(tidyr)
tmp %>%
pivot_longer(cols = -id, values_to = "response") %>% # pivot all columns but id
filter(response != "") %>% # remove blanks
group_by(response) %>% # group by response
summarize(count = n()) # summarize and count
# A tibble: 4 x 2
value count
<chr> <int>
1 Blue 3
2 Green 3
3 Purple 5
4 Red 4
回答2:
You can use na_if()
in dplyr
to convert ""
to NA
and then pivot_longer()
in tidyr
to stack all columns starting with Q3
.
Note: The use of na_if()
is to make values_drop_na = T
in pivot_longer()
work.
library(dplyr)
library(tidyr)
tmp %>%
mutate(across(-id, na_if, "")) %>%
pivot_longer(-id, values_drop_na = T) %>%
count(value)
# # A tibble: 4 x 2
# value n
# <chr> <int>
# 1 Blue 3
# 2 Green 3
# 3 Purple 5
# 4 Red 4
or use colSums()
and tibble::enframe()
tibble::enframe(colSums(tmp[-1] != ""))
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 Q3_Red 4
# 2 Q3_Blue 3
# 3 Q3_Green 3
# 4 Q3_Purple 5
回答3:
In base R we can use
summary(tmp[-1])
# Q3_Red Q3_Blue Q3_Green Q3_Purple
# :6 :7 :7 :5
# Red:4 Blue:3 Green:3 Purple:5
回答4:
You can try this approach
Calculate the frequency per color column
tmp2 <- colSums(tmp[, 2:5] != "", na.rm =TRUE)
Convert it to a data frame, then convert from rowname to column, finally use regex to remove unnecessary letter to get the expected result
tmp2 <- data.frame(tmp2) %>%
tibble::rownames_to_column(var = "Colors") %>%
mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>%
rename(freq = tmp2)
# Colors freq
# 1 Red 4
# 2 Blue 3
# 3 Green 3
# 4 Purple 5
来源:https://stackoverflow.com/questions/63311409/how-do-i-summarize-data-that-is-broken-into-many-columns