Consolidating duplicate rows in a dataframe [duplicate]

问题

This is a continuation of a past question I asked. Basically, I have a dataframe, df

         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F
Pathway6    A         G           NA           NA           E
Pathway2    A         B           H            NA           F
Pathway5    A         B           H            NA           E
Pathway1    A         D           K            NA           F
Pathway7    A         B           C            D            F
Pathway4    A         B           C            D            E

And now I want to consolidate the rows to look like this:

dfnew 
         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3    A         G           NA           NA           F, E
Pathway2    A         B           H            NA           F, E
Pathway7    A         D           K            NA           F    
Pathway1    A         B           C            D            F, E

I've seen a lot of people consolidate identical rows in dataframes using aggregate, but I can't seem to get that function to work on non-numerical values. The closest question I have seen solved it like this: df1 <- aggregate(df[7], df[-7], unique) and can be found here: Combining duplicated rows in R and adding new column containing IDs of duplicates.

Also, not every pathway has a matching pair, as can be seen in pathway 1.

Thank you so much for your help!

回答1:

The following solution using the ‹dplyr› and ‹tidyr› packages should do what you want:

df %>%
    group_by(Protein2, Protein3, Protein4) %>%
    nest() %>%
    mutate(Biomarker1 = lapply(data, `[[`, 'Biomarker1'),
           Biomarker1 = unlist(lapply(Biomarker1, paste, collapse = ', '))) %>%
    ungroup() %>%
    # Restore the “Beginning1” column is a bit of work, unfortunately.
    mutate(Beginning1 = lapply(data, `[[`, 'Beginning1'),
           Beginning1 = unlist(lapply(Beginning1, `[[`, 1))) %>%
    select(-data)

回答2:

This is a dplyr solution which should yield the expected result.

library(dplyr)

df <- df %>%
      group_by(Beginning1, Protein2, Protein3, Protein4) %>%
      summarise(Biomarker1 = paste(Biomarker1, collapse = ", "))

来源：https://stackoverflow.com/questions/44809293/consolidating-duplicate-rows-in-a-dataframe

标签

dataframe

aggregate