I have a dataframe in R containing the columns ID.A, ID.B and DISTANCE, where distance represents the distance between ID.A and ID.B. For each value (1->n) of ID.A, there may be multiple values of ID.B and DISTANCE (i.e. there may be multiple duplicate rows in ID.A e.g. all of value 4 which each has a different ID.B and distance in that row).
I would like to be able to remove rows where ID.A is duplicated, but conditional upon the distance value such that I am left with the smallest distance values for each ID.A record.
Hopefully that makes sense?
Many thanks in advance
EDIT
Hopefully an example will prove more useful than my text. Here I would like to remove the second and third rows where ID.A = 3:
myDF <- read.table(text="ID.A ID.B DISTANCE
1 3 1
2 6 8
3 2 0.4
3 3 1
3 8 5
4 8 7
5 2 11", header = TRUE)
You can also do it easily in base R. If dat is your dataframe,
do.call(rbind,
by(dat, INDICES=list(dat$ID.A),
FUN=function(x) head(x[order(x$DISTANCE), ], 1)))
One possibility:
myDF <- myDF[order(myDF$ID.A, myDF$DISTANCE), ]
newdata <- myDF[which(!duplicated(myDF$ID.A)),]
Which gives :
ID.A ID.B DISTANCE
1 1 3 1.0
2 2 6 8.0
5 3 2 0.4
6 4 8 7.0
7 5 2 11.0
You can use the plyr package for that. For example, if your data are like these :
d <- data.frame(id.a=c(1,1,1,2,2,3,3,3,3),
id.b=c(1,2,3,1,2,1,2,3,4),
dist=c(12,10,15,20,18,16,17,25,9))
id.a id.b dist
1 1 1 12
2 1 2 10
3 1 3 15
4 2 1 20
5 2 2 18
6 3 1 16
7 3 2 17
8 3 3 25
9 3 4 9
You can use the ddply function like this :
library(plyr)
ddply(d, "id.a", function(df) return(df[df$dist==min(df$dist),]))
Which gives :
id.a id.b dist
1 1 2 10
2 2 2 18
3 3 4 9
来源:https://stackoverflow.com/questions/10835284/r-conditionally-remove-duplicate-rows