R - delete consecutive (ONLY) duplicates

后端 未结 4 575
长情又很酷
长情又很酷 2020-12-11 06:55

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive. For example, for the following data fram

相关标签:
4条回答
  • 2020-12-11 07:03

    Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x

    library(data.table)
    dattab <- as.data.table(df)
    dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
    

    This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

    0 讨论(0)
  • 2020-12-11 07:14

    You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

    df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
      x  y z
    3 1 30 3
    5 2 49 5
    6 4 13 6
    8 2 49 8
    9 1 30 9
    
    0 讨论(0)
  • 2020-12-11 07:14

    How about:

    df[cumsum(rle(df$x)$lengths),]
    

    Explanation:

    rle(df$x)
    

    gives you the run lengths and values of consecutive duplicates in the x variable. Then:

    rle(df$x)$lengths
    

    extracts the lengths. Finally:

    cumsum(rle(df$x)$lengths)
    

    gives the row indices which you can select using [.

    EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp being the dplyr answer given by @Nik.

    #> Unit: microseconds
    #>    expr       min         lq       mean     median         uq        max
    #>     rle   134.389   145.4220   162.6967   154.4180   172.8370    375.109
    #>  consec   111.411   118.9235   136.1893   123.6285   145.5765    314.249
    #>      dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
    

    rle performs better than I thought it would.

    0 讨论(0)
  • 2020-12-11 07:17

    A cheap solution with dplyr that I could think of:

    Method:

    library(dplyr)
    df %>% 
      mutate(id = lag(x, 1), 
             decision = if_else(x != id, 1, 0), 
             final = lead(decision, 1, default = 1)) %>% 
      filter(final == 1) %>% 
      select(-id, -decision, -final)
    

    Output:

      x  y z
    1 1 30 3
    2 2 49 5
    3 4 13 6
    4 2 49 8
    5 1 30 9
    

    This will even work if your data has the same x value at the bottom

    New Input:

    df2 <- df %>% add_row(x = 1, y = 10, z = 12)
    df2
    
       x  y  z
    1  1 10  1
    2  1 11  2
    3  1 30  3
    4  2 12  4
    5  2 49  5
    6  4 13  6
    7  2 12  7
    8  2 49  8
    9  1 30  9
    10 1 10 12
    

    Use same method:

    df2 %>% 
      mutate(id = lag(x, 1), 
             decision = if_else(x != id, 1, 0), 
             final = lead(decision, 1, default = 1)) %>% 
      filter(final == 1) %>% 
      select(-id, -decision, -final)
    

    New Output:

      x  y  z
    1 1 30  3
    2 2 49  5
    3 4 13  6
    4 2 49  8
    5 1 10 12
    
    0 讨论(0)
提交回复
热议问题