问题
I have a dataframe with an id column that is repeated, with site counts. I want to know how I can remove the duplicates ID records only when Site_Count record is more than 0.
Generate DF:
DF <- data.frame(
'ID' = sample(100:300, 100, replace=T),
'Site_count' = sample(0:1, 100, replace=T)
)
My attempt at the subset:
subset(DF[!duplicated(DF$ID),], site_count > 0)
But in this case it will remove all 0 site counts - I want to subset to only remove the record when there is a duplicate record with more than 0 site count.
Desirable results would look something like this (notice there site IDs with 0 site count, but no duplicate IDs with 0 and another site count):
ID site count
-- ----------
1 0
2 1
3 1
4 0
5 5
回答1:
The expected output is not very clear. May be this helps:
indx <- with(DF, ave(!Site_count, ID, FUN=function(x) sum(x)>1))
DF[!(duplicated(DF$ID) & indx),]
Update
After re-reading the description, your expected answer could also be:
indx <- with(DF, ave(Site_count, ID, FUN=function(x) any(x>0)))
DF[!(duplicated(DF$ID) & indx),]
回答2:
Possibly this:
set.seed(42)
DF <- data.frame(
'ID' = c(sample(1:3, 10, replace=T), 4),
'Site_count' = c(sample(0:3, 10, replace=T), 0)
)
# ID Site_count
#1 3 1
#2 3 2
#3 1 3
#4 3 1
#5 2 1
#6 2 3
#7 3 3
#8 1 0
#9 2 1
#10 3 2
#11 4 0
fun <- function(x) {
if (length(x) == 1L) return(x) else {
return(x[which.max(x > 0)])
}
}
library(plyr)
ddply(DF, .(ID), summarise, Site_count = fun(Site_count))
# ID Site_count
#1 1 3
#2 2 1
#3 3 1
#4 4 0
来源:https://stackoverflow.com/questions/25682428/r-subset-column-based-on-condition-on-duplicate-rows