I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to
Here's an rle
solution:
df[cumsum(rle(as.character(df$x))$lengths), ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10
Explanation:
RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3)
has a runs vector of c(3, 2, 3)
and lengths c(1, 2, 1)
. In this example, the cumulative sum of the lengths produces c(1, 3, 4)
. Subset x
with this vector and you get c(3, 2, 3)
. Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.