R, conditionally remove duplicate rows

I have a dataframe in R containing the columns ID.A, ID.B and DISTANCE, where distance represents the distance between ID.A and ID.B. For each value (1->n) of ID.A, there may be multiple values of ID.B and DISTANCE (i.e. there may be multiple duplicate rows in ID.A e.g. all of value 4 which each has a different ID.B and distance in that row).

I would like to be able to remove rows where ID.A is duplicated, but conditional upon the distance value such that I am left with the smallest distance values for each ID.A record.

Hopefully that makes sense?

Many thanks in advance

EDIT

Hopefully an example will prove more useful than my text. Here I would like to remove the second and third rows where ID.A = 3:

myDF <- read.table(text="ID.A ID.B DISTANCE
  1 3 1
  2 6 8
  3 2 0.4
  3 3 1
  3 8 5
  4 8  7
  5 2 11", header = TRUE)

You can also do it easily in base R. If dat is your dataframe,

do.call(rbind, 
        by(dat, INDICES=list(dat$ID.A), 
           FUN=function(x) head(x[order(x$DISTANCE), ], 1)))

One possibility:

myDF <- myDF[order(myDF$ID.A, myDF$DISTANCE), ] 

newdata <- myDF[which(!duplicated(myDF$ID.A)),]

Which gives :

    ID.A ID.B DISTANCE
1    1    3      1.0
2    2    6      8.0
5    3    2      0.4
6    4    8      7.0
7    5    2     11.0

You can use the plyr package for that. For example, if your data are like these :

d <- data.frame(id.a=c(1,1,1,2,2,3,3,3,3),
                id.b=c(1,2,3,1,2,1,2,3,4),
                dist=c(12,10,15,20,18,16,17,25,9))

  id.a id.b dist
1    1    1   12
2    1    2   10
3    1    3   15
4    2    1   20
5    2    2   18
6    3    1   16
7    3    2   17
8    3    3   25
9    3    4    9

You can use the ddply function like this :

library(plyr)
ddply(d, "id.a", function(df) return(df[df$dist==min(df$dist),]))

Which gives :

  id.a id.b dist
1    1    2   10
2    2    2   18
3    3    4    9

来源：https://stackoverflow.com/questions/10835284/r-conditionally-remove-duplicate-rows

标签

conditional

duplicates