R - delete consecutive (ONLY) duplicates

后端未结

关注

 4  575

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive. For example, for the following data fram

相关标签:

4条回答

鱼传尺愫

2020-12-11 07:03
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
```
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
```
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.
0 讨论(0)
发布评论:

提交评论
- 加载中...
礼貌的吻别

2020-12-11 07:14
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
```
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
  x  y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-12-11 07:14
How about:
```
df[cumsum(rle(df$x)$lengths),]
```
Explanation:
```
rle(df$x)
```
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
```
rle(df$x)$lengths
```
extracts the lengths. Finally:
```
cumsum(rle(df$x)$lengths)
```
gives the row indices which you can select using [.

EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp being the dplyr answer given by @Nik.
```
#> Unit: microseconds
#>    expr       min         lq       mean     median         uq        max
#>     rle   134.389   145.4220   162.6967   154.4180   172.8370    375.109
#>  consec   111.411   118.9235   136.1893   123.6285   145.5765    314.249
#>      dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
```
rle performs better than I thought it would.
0 讨论(0)
发布评论:

提交评论
- 加载中...

南笙

2020-12-11 07:17

A cheap solution with dplyr that I could think of:

Method:

library(dplyr)
df %>% 
  mutate(id = lag(x, 1), 
         decision = if_else(x != id, 1, 0), 
         final = lead(decision, 1, default = 1)) %>% 
  filter(final == 1) %>% 
  select(-id, -decision, -final)

Output:

This will even work if your data has the same x value at the bottom

New Input:

df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2

   x  y  z
1  1 10  1
2  1 11  2
3  1 30  3
4  2 12  4
5  2 49  5
6  4 13  6
7  2 12  7
8  2 49  8
9  1 30  9
10 1 10 12

Use same method:

df2 %>% 
  mutate(id = lag(x, 1), 
         decision = if_else(x != id, 1, 0), 
         final = lead(decision, 1, default = 1)) %>% 
  filter(final == 1) %>% 
  select(-id, -decision, -final)

New Output:

0 讨论(0)