How do I summarize data that is broken into many columns?

问题

I have a data set that contains answers to a "select as many as apply" question with each possible answer in separate columns. So, assuming our question is "What color shirt is acceptable to you?" it looks something like this:

id    Q3_Red Q3_Blue Q3_Green    Q3_Purple
9                    
8                    Green       Purple
7                    Green     
6     Red               
5                                Purple
4            Blue          
3            Blue                Purple
2     Red    Blue    Green     
1     Red                        Purple
10    Red                        Purple

Which you can make into an actual data frame with:

tmp <- data.frame("id" = c(009,008,007,006,005,004,003,002,001,010), "Q3_Red" = c("","","","Red","","","","Red","Red","Red"), "Q3_Blue" = c("","","","","","Blue","Blue","Blue","",""),
  "Q3_Green" = c("","Green","Green","","","","","Green","",""),
  "Q3_Purple" = c("","Purple","","","Purple","","Purple","","Purple","Purple")
)

I want to summarize it with a count of each answer, eg.

Red     4
Blue    3
Green   3
Purple  5

I can get a count of each with something like tmp %>% count(Q3_Red) and organize those into their own data frame but it seems like there must be a way to use a reshape function to do this in one fell swoop. I've looked at gather() and spread() but I can't wrap my head around how I would combine tidyr with count().

回答1:

dplyr and tidyr are your friend here:

library(dplyr)
library(tidyr)
tmp %>% 
  pivot_longer(cols = -id, values_to = "response") %>%   # pivot all columns but id
  filter(response != "") %>%        # remove blanks
  group_by(response) %>%            # group by response
  summarize(count = n())            # summarize and count
# A tibble: 4 x 2
  value  count
  <chr>  <int>
1 Blue       3
2 Green      3
3 Purple     5
4 Red        4

回答2:

You can use na_if() in dplyr to convert "" to NA and then pivot_longer() in tidyr to stack all columns starting with Q3.

Note: The use of na_if() is to make values_drop_na = T in pivot_longer() work.

library(dplyr)
library(tidyr)

tmp %>% 
  mutate(across(-id, na_if, "")) %>% 
  pivot_longer(-id, values_drop_na = T) %>%
  count(value)

# # A tibble: 4 x 2
#   value      n
#   <chr>  <int>
# 1 Blue       3
# 2 Green      3
# 3 Purple     5
# 4 Red        4

or use colSums() and tibble::enframe()

tibble::enframe(colSums(tmp[-1] != ""))

# # A tibble: 4 x 2
#   name      value
#   <chr>     <dbl>
# 1 Q3_Red        4
# 2 Q3_Blue       3
# 3 Q3_Green      3
# 4 Q3_Purple     5

回答3:

In base R we can use

summary(tmp[-1])
# Q3_Red  Q3_Blue   Q3_Green  Q3_Purple
#     :6       :7        :7         :5  
#  Red:4   Blue:3   Green:3   Purple:5

回答4:

You can try this approach

Calculate the frequency per color column

tmp2 <- colSums(tmp[, 2:5] != "", na.rm =TRUE)

Convert it to a data frame, then convert from rowname to column, finally use regex to remove unnecessary letter to get the expected result

tmp2 <- data.frame(tmp2) %>% 
  tibble::rownames_to_column(var = "Colors") %>% 
  mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>% 
  rename(freq = tmp2)
#   Colors freq
# 1    Red    4
# 2   Blue    3
# 3  Green    3
# 4 Purple    5

来源：https://stackoverflow.com/questions/63311409/how-do-i-summarize-data-that-is-broken-into-many-columns

标签

tidyr