How do I combine duplicate rows without losing unique data in R or VBA?

问题

I have a table in excel where there's unique columns but many duplicate rows. Duplicates is measured by the column "uniqueID" which is an email stored as a string. Rows can have the same uniqueID but with missing data in other columns, or different data in the same column.

I want to be able to combine and merge these duplicate rows such if the same uniqueID has the same response for the strings will be combined and concatenated such that data won't be lost. All data are strings.

I've tried the Aggregate function in R and dplyr with but with no success, mostly because I'm still unsure of how those 2 functions work.

Input:

uniqueID, favFruits, favVeggie, State, favColor
john@mail.com, NULL, carrots, CA, Green
jill@mail.com, apples, NULL, FL, NULL
john@mail.com, grapes, beets, CA, Red
jill@mail.com, cherries, beans, FL, Blue
jill@mail.com, pineapple, beans, FL, Blue 
john@mail.com, grapes, beets, CA, Yellow

Output:

uniqueID, favFruits, favVeggie, State, favColor
john@mail.com, grapes, (carrots, beets), CA, (Green, Red, Yellow)
jill@mail.com, (apples, cherries, pineapple), beans, FL, Blue

Note:

"NULL" in this sense is just a blank excel cell. It isn't named NULL or anything. Full dataset has ~30 columns total and ~20000 rows. The "()" in each column is there to signify one cell containing both values, rather than having parenthesis inside the cells.

回答1:

I would take Dave2e's answer and take it a step further and remove all the NULLs like this:

library(tidyverse)

input <- tibble::tribble(
          ~uniqueID,  ~favFruits, ~favVeggie, ~State, ~favColor,
    "john@mail.com",      "NULL",  "carrots",   "CA",   "Green",
    "jill@mail.com",    "apples",     "NULL",   "FL",    "NULL",
    "john@mail.com",    "grapes",    "beets",   "CA",     "Red",
    "jill@mail.com",  "cherries",    "beans",   "FL",    "Blue",
    "jill@mail.com", "pineapple",    "beans",   "FL",    "Blue",
    "john@mail.com",    "grapes",    "beets",   "CA",  "Yellow"
    )


output <- input %>% 
    mutate_all(list(~str_replace(., "NULL", NA_character_))) %>% 
    group_by(uniqueID) %>% 
    summarise_all(list(~toString(unique(na.omit(.)))))

output

# A tibble: 2 x 5
  uniqueID      favFruits                   favVeggie      State favColor          
  <chr>         <chr>                       <chr>          <chr> <chr>             
1 jill@mail.com apples, cherries, pineapple beans          FL    Blue              
2 john@mail.com grapes                      carrots, beets CA    Green, Red, Yellow

回答2:

This is a straight forward problem with the use of the dplyr library. The key is to group by the uniqueID and use the toString to concatenate the unique strings together.

df<-read.table(header=TRUE, text="uniqueID favFruits favVeggie State favColor
john@mail.com NA carrots CA Green
jill@mail.com apples NA FL NA
john@mail.com grapes beets CA Red
jill@mail.com cherries beans FL Blue
jill@mail.com pineapple beans FL Blue 
john@mail.com grapes beets CA Yellow")


library(dplyr)
 answer<- df %>% group_by(uniqueID) %>% summarize_all(list(~toString(unique(.))) ) 

print(answer)
# A tibble: 2 x 5
  uniqueID      favFruits                   favVeggie      State favColor          
  <fct>         <chr>                       <chr>          <chr> <chr>             
1 jill@mail.com apples, cherries, pineapple NA, beans      FL    NA, Blue          
2 john@mail.com NA, grapes                  carrots, beets CA    Green, Red, Yellow

来源：https://stackoverflow.com/questions/55854499/how-do-i-combine-duplicate-rows-without-losing-unique-data-in-r-or-vba

标签

excel

vba