removing duplicate units from data frame

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.

set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
      "1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
      "1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))

test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)

'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,

head(test)
  UNIT     DATE OUT1      JUNK1      JUNK2
1    1 1/1/2010  300 -1.2070657 -0.9111954
2    1 1/1/2010  400  0.2774292 -0.8371717
3    1 1/1/2010  400  1.0844412  2.4158352
4    1 1/2/2012  400 -2.3456977  0.1340882
5    2 1/2/2009  600  0.4291247 -0.4906859
6    2 1/2/2004  700  0.5060559 -0.4405479

Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:

subset(test, !duplicated(c(UNIT,DATE,OUT1)))

Which unfortunately does not do the trick:

      UNIT     DATE OUT1       JUNK1      JUNK2
1        1 1/1/2010  300 -1.20706575 -0.9111954
5        2 1/2/2009  600  0.42912469 -0.4906859
8        3 1/2/2005  800 -0.54663186 -0.6937202
11       4 1/1/2011  900 -0.47719270 -1.0236557
14       5 1/1/2008  100  0.06445882  1.1022975
15       6 1/1/2012  100  0.95949406 -0.4755931

Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?

mnel

You can pass a data.frame to duplicated

In your case, you want to pass the first 3 columns of test

 test2 <- test[!duplicated(test[,1:3]),]

If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique

library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)

For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table

Thanks! Looks like we can do:

test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]

and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter

You can use distinct() from the dplyr package:

library(dplyr)
test %>%
  distinct(UNIT, DATE, OUT1)

Or without the %>% pipe:

distinct(test, UNIT, DATE, OUT1)

来源：https://stackoverflow.com/questions/15489185/removing-duplicate-units-from-data-frame

标签

duplicates

bigdata

duplicate-removal