Remove duplicates based on specific criteria

强颜欢笑 提交于 2019-12-19 05:09:03

问题


I have a dataset that looks something like this:

df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L, 
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L, 
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA, 
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =     
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA, 
-9L))

I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0

In other words, remove records 1, 3, and 5.

I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]

The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).


回答1:


If you order your data frame first, then you can make sure duplicated keeps the ones you want:

df.tmp <- with(df, df[order(ifelse(is.na(Company) | Amount == 0, 1, 0)), ])
df.tmp[!duplicated(df.tmp$Claim.Num), ]
#   Claim.Num Amount  Company
# 2       500   1000      ATT
# 4       600    564 T Mobile
# 6       700    200    Petco
# 7       100     NA     <NA>
# 8       200      0    Petco
# 9       300     NA     <NA>



回答2:


Slightly different approach

r <- merge(df,
           aggregate(df$Amount,by=list(Claim.Num=df$Claim.Num),length),
           by="Claim.Num")
result <-r[!(r$x>1 & (is.na(r$Company) | (r$Amount==0))),-ncol(r)]
result
#   Claim.Num Amount  Company
# 1       100     NA     <NA>
# 2       200      0    Petco
# 3       300     NA     <NA>
# 5       500   1000      ATT
# 7       600    564 T Mobile
# 9       700    200    Petco

This adds a column x to indicate which rows have Claim.Num present more than once, then filters the result based on your criteria. The use of -ncol(r) just removes the column x at the end.




回答3:


Another way based on subset and logical indices:

subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) |  
         (!is.na(Amount) & Amount))

  Claim.Num Amount  Company
2       500   1000      ATT
4       600    564 T Mobile
6       700    200    Petco
7       100     NA     <NA>
8       200      0    Petco
9       300     NA     <NA>


来源:https://stackoverflow.com/questions/21788378/remove-duplicates-based-on-specific-criteria

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!