Collapse / concatenate / aggregate multiple columns to a single comma separated string within each group

╄→гoц情女王★ 提交于 2021-02-13 17:04:11


This is an extension to post Collapse / concatenate / aggregate a column to a single comma separated string within each group

Goal: aggregate multiple columns according to one grouping variable and separate individual values by separator of choice.

Reproducible example:

data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c(rep(c(100), 3), rep(200,3)), C = rep(c(1,2,NA),2), D = c(15:20), E = rep(c(1,NA,NA),2))
    A   B  C  D  E
1 111 100  1 15  1
2 111 100  2 16 NA
3 111 100 NA 17 NA
4 222 200  1 18  1
5 222 200  2 19 NA
6 222 200 NA 20 NA

A is the grouping variable but B is still displayed in overall result (B depends on A in my application) and C, D and E are the variables to be collapsed into separated character strings.

Desired Output

    A   B  C    D         E
1 111 100  1,2  15,16,17  1
2 222 100  1,2  18,19,20  1    

I don't have a ton of experience with R. I did try to expand upon the solutions posted by G. Grothendieck to the linked post to meet my requirements but can't quite get it right for multiple columns.

What would be a proper implementation to get the desired output?

I focused specifically on group_by and summarise_all and aggregate in my attempts. They are a complete mess so I don't believe it would even be helpful to display.

EDIT: Solutions posted work great at displaying desired result! To continue improving the value in this post for those that find it.

How would it be possible for users to select their own separation characters. e.g. '-', '\n' The current solutions by @akrun and @tmfmnk both result in lists instead of a concatenated character string. Please correct me if I said this incorrectly.

[1] 15 16 17 18 19 20
> data$A
[1] 111 111 111 222 222 222
> data$B
[1] 100 100 100 200 200 200
> data$C
[1]  1  2 NA  1  2 NA
> data$D
[1] 15 16 17 18 19 20
> data$E
[1]  1 NA NA  1 NA NA


We can group by 'A', 'B', and use summarise_at to paste all the non-NA elements

data %>% 
    group_by(A, B) %>%
    summarise_at(vars(-group_cols()), ~ toString(.[!]))
# A tibble: 2 x 5
# Groups:   A [2]
#      A     B C     D          E    
#  <dbl> <dbl> <chr> <chr>      <chr>
#1   111   100 1, 2  15, 16, 17 1    
#2   222   200 1, 2  18, 19, 20 1   

If we need to pass custom delimiter, use paste or str_c

data %>% 
    group_by(A, B) %>%
    summarise_at(vars(-group_cols()), ~ str_c(.[!], collapse="_"))

Or using base R with aggregate

aggregate(. ~ A + B, data, FUN = function(x) 
      toString(x[!]), na.action = NULL)


With dplyr, you can do:

data %>%
 group_by(A, B) %>%
 summarise_all(~ toString(na.omit(.)))

      A     B C     D          E    
  <dbl> <dbl> <chr> <chr>      <chr>
1   111   100 1, 2  15, 16, 17 1    
2   222   200 1, 2  18, 19, 20 1 

