Tabulate responses for multiple columns by grouping variable with dplyr

跟風遠走 提交于 2019-12-04 14:44:16

You can get counts by group for each of var1, var2, and var3 if you "melt" your data frame into long form first, which will "stack" the three var columns into a single column (value) and then create an additional column (variable) marking which rows go with which var.

library(dplyr)
library(reshape2)

#sample data
group <- sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))

df<-data.frame(group, var1, var2, var3)

out_df <- df %>% 
  melt(id.var="group") %>%
  filter(!is.na(value)) %>%  # Remove NA
  group_by(group, variable, value) %>%
  summarise(count=n()) %>% 
  group_by(group, variable) %>% 
  mutate(percent=count/sum(count))  

You can stop the function chain at any point to look at the intermediate steps, which will help in understanding what each step is doing.

Because we grouped by group, variable, and value, we end up with count giving us the number of rows for combination of those three columns. Then we group only by group and variable to calculate the percentage of rows that each value of count contributes to each combination of the two grouping variables. (The second group_by is not essential, because dplyr drops the last grouping variable after a summarise operation (because there will only be one row for each combination of all the original grouping variables) but I prefer to regroup explicitly.)

Here's the final result:

out_df

   group variable value count    percent
1      A     var1     1     6 0.26086957
2      A     var1     2     3 0.13043478
3      A     var1     3     6 0.26086957
4      A     var1     4     1 0.04347826
5      A     var1     5     7 0.30434783
...
41     C     var3     1     6 0.25000000
42     C     var3     2     5 0.20833333
43     C     var3     3     4 0.16666667
44     C     var3     4     2 0.08333333
45     C     var3     5     7 0.29166667

If, by tabulate you mean sum then try this:

out_df<-df %>%group_by(group)
summarise(out_df,var1=sum(var1,na.rm=TRUE),var2=sum(var2,na.rm=TRUE),var3=sum(var3,na.rm=TRUE))

which yields something like this:

      group var1 var2 var3
  1     A   67   72   60
  2     B   88   92   97
  3     C   54   77   48
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!