I have been struggling with how to select ONLY duplicated rows of data.frame in R. For Instance, my data.frame is:
age=18:29
height=c(76.1,77,78.1,78.2,78.8
A solution using duplicated
twice:
village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]
Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8
An alternative solution with by
:
village[unlist(by(seq(nrow(village)), village$Names,
function(x) if(length(x)-1) x)), ]
I came up with a solution using nested sapply:
> village_dups =
village[unique(unlist(which(sapply(sapply(village$Names,function(x)
which(village$Names==x)),function(y) length(y)) > 1))),]
> village_dups
Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8
village[ duplicated(village),]
I find @Sven's answer using duplicated the "tidiest", but you can also do this many other ways. Here are two more:
Use table()
and subset by matching the names where the tabulation is > 1 with the names present in the first column:
village[village$Names %in% names(which(table(village$Names) > 1)), ]
Use ave()
to "tabulate" in a little different manner, but subset in the same way:
village[with(village, ave(as.numeric(Names), Names, FUN = length) > 1), ]