问题
I have a table in excel where there's unique columns but many duplicate rows. Duplicates is measured by the column "uniqueID" which is an email stored as a string. Rows can have the same uniqueID but with missing data in other columns, or different data in the same column.
I want to be able to combine and merge these duplicate rows such if the same uniqueID has the same response for the strings will be combined and concatenated such that data won't be lost. All data are strings.
I've tried the Aggregate function in R and dplyr with but with no success, mostly because I'm still unsure of how those 2 functions work.
Input:
uniqueID, favFruits, favVeggie, State, favColor
john@mail.com, NULL, carrots, CA, Green
jill@mail.com, apples, NULL, FL, NULL
john@mail.com, grapes, beets, CA, Red
jill@mail.com, cherries, beans, FL, Blue
jill@mail.com, pineapple, beans, FL, Blue
john@mail.com, grapes, beets, CA, Yellow
Output:
uniqueID, favFruits, favVeggie, State, favColor
john@mail.com, grapes, (carrots, beets), CA, (Green, Red, Yellow)
jill@mail.com, (apples, cherries, pineapple), beans, FL, Blue
Note:
"NULL" in this sense is just a blank excel cell. It isn't named NULL or anything. Full dataset has ~30 columns total and ~20000 rows. The "()" in each column is there to signify one cell containing both values, rather than having parenthesis inside the cells.
回答1:
I would take Dave2e's answer and take it a step further and remove all the NULLs like this:
library(tidyverse)
input <- tibble::tribble(
~uniqueID, ~favFruits, ~favVeggie, ~State, ~favColor,
"john@mail.com", "NULL", "carrots", "CA", "Green",
"jill@mail.com", "apples", "NULL", "FL", "NULL",
"john@mail.com", "grapes", "beets", "CA", "Red",
"jill@mail.com", "cherries", "beans", "FL", "Blue",
"jill@mail.com", "pineapple", "beans", "FL", "Blue",
"john@mail.com", "grapes", "beets", "CA", "Yellow"
)
output <- input %>%
mutate_all(list(~str_replace(., "NULL", NA_character_))) %>%
group_by(uniqueID) %>%
summarise_all(list(~toString(unique(na.omit(.)))))
output
# A tibble: 2 x 5
uniqueID favFruits favVeggie State favColor
<chr> <chr> <chr> <chr> <chr>
1 jill@mail.com apples, cherries, pineapple beans FL Blue
2 john@mail.com grapes carrots, beets CA Green, Red, Yellow
回答2:
This is a straight forward problem with the use of the dplyr library. The key is to group by the uniqueID and use the toString
to concatenate the unique strings together.
df<-read.table(header=TRUE, text="uniqueID favFruits favVeggie State favColor
john@mail.com NA carrots CA Green
jill@mail.com apples NA FL NA
john@mail.com grapes beets CA Red
jill@mail.com cherries beans FL Blue
jill@mail.com pineapple beans FL Blue
john@mail.com grapes beets CA Yellow")
library(dplyr)
answer<- df %>% group_by(uniqueID) %>% summarize_all(list(~toString(unique(.))) )
print(answer)
# A tibble: 2 x 5
uniqueID favFruits favVeggie State favColor
<fct> <chr> <chr> <chr> <chr>
1 jill@mail.com apples, cherries, pineapple NA, beans FL NA, Blue
2 john@mail.com NA, grapes carrots, beets CA Green, Red, Yellow
来源:https://stackoverflow.com/questions/55854499/how-do-i-combine-duplicate-rows-without-losing-unique-data-in-r-or-vba