This question already has an answer here:
Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.
My dataset is something like:
Sample group Gene1 Gene2 Gene3
A 1 a NA NA
A 2 b NA NA
B 1 NA c NA
C 1 a NA d
C 2 b NA e
C 3 c NA NA
I want to get it into a format where each samples takes only 1 row (the group column is optional):
Sample group Gene1 Gene2 Gene3
A 1,2 a,b NA NA
B 1 NA c NA
C 1,2,3 a,b,c NA d,e
Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate.
I know aggregate or dplyr can be used to get the groups but I can't figure out how to do it for multiple columns.
Thanks in advance!
Edit
As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:
setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]
The output is now:
Sample group Gene1 Gene2 Gene3
1: A 1,2 a,b
2: B 1 c
3: C 1,2,3 a,b,c d,e
Thanks for all your help!
For these purposes, there are the summarise_all, summarise_at, and summarise_if functions. Using summarise_all:
df %>%
group_by(Sample) %>%
summarise_all(funs(paste(na.omit(.), collapse = ",")))
# A tibble: 3 × 5 Sample group Gene1 Gene2 Gene3 <chr> <chr> <chr> <chr> <chr> 1 A 1,2 a,b 2 B 1 c 3 C 1,2,3 a,b,c d,e
using dplyr, you can try:
dft %>%
group_by(Sample) %>%
summarise_each(funs( toString(unique(.))))
which gives:
# A tibble: 3 × 5
Sample group Gene1 Gene2 Gene3
<chr> <chr> <chr> <chr> <chr>
1 A 1, 2 a, b NA NA
2 B 1 NA c NA
3 C 1, 2, 3 a, b, c NA d, e, NA
Edit:
@Axeman had the right idea with using na.omit(.) to get rid of null values
来源:https://stackoverflow.com/questions/42288757/concatenate-strings-by-group-with-dplyr-for-multiple-columns