remove duplicates and collapse near duplicates based on time difference

问题

I have a data-frame like as shown below

DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40", 
                                                                     "8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30", 
                                                                     "19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                            "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                            "NA-NA-NA NA:NA:NA"), person_id = c("21",
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "31"
                                                                                                            ), enc_id = c("A21BC","A21BC", 
                                                                                                                                       "A22BC", 
                                                                                                                                       "A23BC", 
                                                                                                                                       "A24BC", 
                                                                                                                                       "A25BC", 
                                                                                                                                       "A31BC"
                                                                                                            )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
                                                                                                            ))

dataframe

  Age_visit Date_1          Date_2            person_id enc_id
      <dbl> <chr>           <chr>             <chr>     <chr> 
1        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
2        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
3        48 8/6/2169 9:41   NA-NA-NA NA:NA:NA  21        A22BC 
4        49 8/6/2169 9:42   NA-NA-NA NA:NA:NA  21        A23BC 
5        49 24/7/2169 8:31  NA-NA-NA NA:NA:NA  21        A24BC 
6        77 12/9/2169 10:30 NA-NA-NA NA:NA:NA  31        A31BC

I have two rules/steps to be implemented.

Rule-1 (step-1)

First, remove duplicates based on 3 columns like Date_1, person_id, enc_id

DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]  # this will remove 1st row as it's a plain straight forward duplicate

Rule-2 (step-2)

From the output of step-1, collapse near duplicate records (notice tiny differences in DATE_1 and enc_id columns) based on time into one single record if the time difference between these records is less than hour.

For ex, if you see person_id = 21, you can see that after step-1, all his Date_1 time values are on the same day but the difference is only one minute (9:40 --> 9:41 --> 9:42). Since it's less than an hour (60 mins), we collapse all of them into one single record and retain only the first record (which is for 9:40). We do this check for each subject in the dataframe

I have removed the duplicates based on few columns like as shown below

DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]

I expect my output to be like as shown below

  Age_visit Date_1          Date_2            person_id enc_id
      <dbl> <chr>           <chr>             <chr>     <chr> 
1        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
4        49 24/7/2169 8:31  NA-NA-NA NA:NA:NA  21        A24BC 
5        77 12/9/2169 10:30 NA-NA-NA NA:NA:NA  31        A31BC

回答1:

A rolling join option using data.table:

DT[, c("rn", "hrago") := .(.I, Date_1 - 60 * 60)]
DT[DT[DT, on=.(person_id, Date_1=hrago), roll=-Inf, unique(rn)]]

output:

   Age_visit              Date_1 person_id enc_id rn               hrago
1:        48 2169-06-08 09:40:00        21  A21BC  1 2169-06-08 08:40:00
2:        49 2169-07-24 08:31:00        21  A24BC  5 2169-07-24 07:31:00
3:        77 2169-09-12 10:30:00        31  A31BC  6 2169-09-12 09:30:00

data:

library(data.table)
DT <- fread("Age_visit Date_1    person_id enc_id
48 8/6/2169-9:40    21        A21BC 
48 8/6/2169-9:40    21        A21BC 
48 8/6/2169-9:41    21        A22BC 
49 8/6/2169-9:42    21        A23BC 
49 24/7/2169-8:31   21        A24BC 
77 12/9/2169-10:30  31        A31BC") 
DT[, Date_1 := as.POSIXct(Date_1, format="%d/%m/%Y-%H:%M")]

Explanation:

1) DT[DT, on=.(person_id, Date_1=hrago), is a self-join using person_id from both tables and Date_1 from right table and hrago from left table.

2) roll=-Inf rolls the observation in the right table backwards if an identical match for the observation in the left table is not found

3) unique(rn) takes the unique rows from the right table and then filter the table for these rows.

回答2:

Your question can be solved using a dplyr pipeline.

The first step solves the duplicate problem using distinct().
The seconds step changes the Date_1 column into a Datetime type (necessary for calculating time difference.
The third step adds a column with the previous timestamp using lag(). This must be in a group_by() on person_id to make sure that time stamps are not shifted to other people. Also, it is important to make sure the date is arrange properly (using the arrange()).
The fourth step calculates a time difference since the previous timestamp in seconds. This will give an NA for the first row of a person.
The fifth step removes all records with a time difference of less than one hour
The last step removes all additional columns that were created in the pipeline.

library(dplyr)

DF %>% 
  distinct(Date_1, person_id , enc_id, .keep_all = T) %>% 
  mutate(Date_1 = as.POSIXct(Date_1, format = '%d/%m/%Y %H:%M')) %>% 
  group_by(person_id) %>% 
  arrange(Date_1) %>%
  mutate(Date_lag = lag(Date_1)) %>% 
  ungroup() %>% 
  mutate(Date_diff = difftime(Date_1, Date_lag, units = 'secs')) %>% 
  filter(is.na(Date_diff) | Date_diff >= 3600) %>% 
  select(Age_visit, Date_1, Date_2, person_id, enc_id)

回答3:

You can do both in the same step, by checking successive time difference. Duplicates have a time difference of 0:

library(dplyr)
library(lubridate)

DF %>%
  group_by(person_id)%>%
  mutate(Date_1 = dmy_hm(Date_1)) %>%
  arrange((Date_1)) %>%
  filter(c(5000,diff(Date_1))>3600)


  Age_visit Date_1              Date_2            person_id enc_id
      <dbl> <dttm>              <chr>             <chr>     <chr> 
1        48 2169-06-08 09:40:00 NA-NA-NA NA:NA:NA 21        A21BC 
2        49 2169-07-24 08:31:00 NA-NA-NA NA:NA:NA 21        A24BC 
3        77 2169-09-12 10:30:00 NA-NA-NA NA:NA:NA 31        A25BC

There was a mistake in your data (person_id 31 was missing). Here is the one I used:

DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40", 
                                                                      "8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30", 
                                                                      "19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                     "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                     "NA-NA-NA NA:NA:NA"), person_id = c("21",
                                                                                                                                         "21", 
                                                                                                                                         "21", 
                                                                                                                                         "21", 
                                                                                                                                         "21", 
                                                                                                                                         "31"
                                                                                                     ), enc_id = c("A21BC","A21BC", 
                                                                                                                   "A22BC", 
                                                                                                                   "A23BC", 
                                                                                                                   "A24BC", 
                                                                                                                   "A25BC", 
                                                                                                                   "A31BC"
                                                                                                     )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
                                                                                                     ))

来源：https://stackoverflow.com/questions/61269745/remove-duplicates-and-collapse-near-duplicates-based-on-time-difference

标签

dataframe

dplyr

data.table

tidyr