replace duplicate values with NA in time series data using dplyr

My data seems a bit different than other similar kind of posts.

box_num      date       x        y
1-Q      2018-11-18   20.2      8
1-Q      2018-11-25   21.23     7.2
1-Q      2018-12-2    21.23     23
98-L     2018-11-25   0.134     9.3
98-L     2018-12-2    0.134     4
76-GI    2018-12-2    22.734    4.562
76-GI    2018-12-9    28        4.562

Here I would like to replace the repeated values with NA in both x and y columns. The code I have tried using dplyr :

(1)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%
  mutate(df$x[duplicated(df$x),] <- NA)

It creates a new column with all NA's instead of just replacing a repeated value with NA

 (2)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%  
distinct(x,.keep_all = TRUE)

The second one just gives the rows that are not duplicated(we are missing the time series) Desired Output :

box_num      date       x        y
    1-Q      2018-11-18   20.2      8
    1-Q      2018-11-25   21.23     7.2
    1-Q      2018-12-2    NA        23
    98-L     2018-11-25   0.134     9.3
    98-L     2018-12-2    NA        4
    76-GI    2018-12-2    22.734    4.562
    76-GI    2018-12-9    28        NA

Using dplyr we can group_by box_num and use mutate_at x and y column and replace the duplicated value by NA.

library(dplyr)

df %>%
  group_by(box_num) %>%
  mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))


# box_num date          x     y
#  <fct>   <fct>      <dbl> <dbl>
#1 1-Q     2018-11-18 20.2    8   
#2 1-Q     2018-11-25 21.2    7.2 
#3 1-Q     2018-12-2  NA     23   
#4 98-L    2018-11-25  0.134  9.3 
#5 98-L    2018-12-2  NA      4   
#6 76-GI   2018-12-2  22.7    4.56
#7 76-GI   2018-12-9  28     NA

A base R option (which might not be the best in this case) would be :

cols <- c("x", "y")
df[cols] <- sapply(df[cols], function(x) 
            ave(x, df$box_num, FUN = function(x) replace(x, duplicated(x), NA)))

Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), specify the columns of interest in .SDcols, replace the duplicated elements in the columns with NA and update those columns by assigning (:=) the output back to the columns

library(data.table)
setDT(df1)[,  c('x', 'y') := lapply(.SD, function(x) 
     replace(x, anyDuplicated(x), NA)), box_num, .SDcols= x:y]
df1
#   box_num       date      x      y
#1:     1-Q 2018-11-18 20.200  8.000
#2:     1-Q 2018-11-25 21.230  7.200
#3:     1-Q  2018-12-2     NA 23.000
#4:    98-L 2018-11-25  0.134  9.300
#5:    98-L  2018-12-2     NA  4.000
#6:   76-GI  2018-12-2 22.734  4.562
#7:   76-GI  2018-12-9 28.000     NA

data

df1 <- structure(list(box_num = c("1-Q", "1-Q", "1-Q", "98-L", "98-L", 
 "76-GI", "76-GI"), date = c("2018-11-18", "2018-11-25", "2018-12-2", 
"2018-11-25", "2018-12-2", "2018-12-2", "2018-12-9"), x = c(20.2, 
 21.23, 20.2, 0.134, 0.134, 22.734, 28), y = c(8, 7.2, 23, 9.3, 
 4, 4.562, 4.562)), class = "data.frame", 
 row.names = c(NA, -7L))

来源：https://stackoverflow.com/questions/54067963/replace-duplicate-values-with-na-in-time-series-data-using-dplyr

标签

dplyr

duplicates

time-series