I have a data frame that I want to remove duplicates that are consecutive (in base). I know rle
may be helpful here but can't think of how to use it. The example output will help to illuminate what I'm asking for.
Generate sample data:
set.seed(12)
samps <- sample(1:5, 20, T)
dat <- data.frame(v1=LETTERS[samps], v2=month.abb[samps])
dat[10, 2] <- "Mar"
Sample data:
v1 v2
1 A Jan
2 E May
3 E May
4 B Feb
5 A Jan
6 A Jan
7 A Jan
8 D Apr
9 A Jan
10 A Mar
11 B Feb
12 E May
13 B Feb
14 B Feb
15 B Feb
16 C Mar
17 C Mar
18 C Mar
19 D Apr
20 A Jan
Desired outcome:
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
20 A Jan
Here's a way, not with rle
, but a way none-the-less:
dat[with(dat, c(TRUE, diff(as.numeric(interaction(v1, v2))) != 0)), ]
This assumes you're using factor
columns, as your sample data implies.
Here a fast solution using filter
dat[(filter(dat,c(-1,1))!= 0)[,1],]
v1 v2
1 A Jan
3 E May
4 B Feb
7 A Jan
8 D Apr
10 A Mar
11 B Feb
12 E May
15 B Feb
18 C Mar
19 D Apr
NA <NA> <NA>
You need to add the last value of the original data to the result.
Using rle
I came up with this
ind <- cumsum(rle(as.character(dat$v1))$length)
dat[ind, ]
ind
indicates either the first or the last of consecutive entries.
EDIT:
A simple solution to Matthews comment would be
dat[15, 2] <- "May"
dat[cumsum(rle(paste0(dat$v1, dat$v2))$length), ]
来源:https://stackoverflow.com/questions/14056153/remove-consecutive-duplicates-from-dataframe