Remove duplicates by multiple conditions

问题

I have data where an individual (Name) appears multiple times in a eggphase category. I would like for there only to be one sample per individual but I don't just want to keep the first one the R finds. I would like to keep the one where the group appears most in all other categories. Hopefully my example helps make this clear.

library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

I would like to just keep the rows where Name appears once grouped by Tissue, Food and Eggphase BUT I want to select the row where Group appears in most if not all different eggphases (with the same Tissue and Food combinations).

   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

I tried

one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

but it only keeps the first entry

  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

Any ideas in how to tell it select the row where Groupappears in most (if not all) of the eggphases within a Tissue Food combination? In my example the group that appears the most within the Tissue and Food combination of wb and fl is c and b but Kia doesn't appear in Group b and so c is a better option. Like this example, my data has duplicates which are from groups which are not the most common Group, how do I make it choose next most common just for that row?

I hope I have made enough sense.

回答1:

One option would be to create a frequency column grouped by 'Tissue', 'Food', 'Group', and then do a descending arrange on 'n' and use distinct

library(dplyr)
myDF %>%
     group_by(Tissue, Food, Group) %>%
     mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>% 
     ungroup %>%
     distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
     select(-n)

回答2:

I guess this post and answer should give me reason to learn dplyr and tidyverse, but since I've put in the effort to give a answer that works, here it is:

myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

# I usually have the following setting active: options(stringsAsFactors=F)
# The following might error without such a setting

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
# Keep only those that aren't a duplicate, or are a duplicate and have the highest ratio
myDF2 <- myDF[with(myDF,which(duplicate==FALSE | (duplicate==TRUE & bestRatio==1))),]

回答3:

Hey thanx for your guys help!! A combination of what you suggested seems to have worked:

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
#this won't tell you that the first entry og the combination is double
# so need to make them check against the previous row
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)

# make new df without duplicates
myDF2 <- myDF %>% 
#arrange in a way that the first duplicate is from the group with the most eggphases
#and the name appears in the most egg phases 
  arrange(Tissue, Food, Eggphase, Name, Group, desc(Group_phaseCt), desc(PhaseMax)) %>% 
#select only distinct rows according to specified var keep all others
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

来源：https://stackoverflow.com/questions/47267725/remove-duplicates-by-multiple-conditions

标签

dplyr

tidyr

tidyverse