I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.
set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
"1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
"1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))
test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)
'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,
head(test)
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.2070657 -0.9111954
2 1 1/1/2010 400 0.2774292 -0.8371717
3 1 1/1/2010 400 1.0844412 2.4158352
4 1 1/2/2012 400 -2.3456977 0.1340882
5 2 1/2/2009 600 0.4291247 -0.4906859
6 2 1/2/2004 700 0.5060559 -0.4405479
Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:
subset(test, !duplicated(c(UNIT,DATE,OUT1)))
Which unfortunately does not do the trick:
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.20706575 -0.9111954
5 2 1/2/2009 600 0.42912469 -0.4906859
8 3 1/2/2005 800 -0.54663186 -0.6937202
11 4 1/1/2011 900 -0.47719270 -1.0236557
14 5 1/1/2008 100 0.06445882 1.1022975
15 6 1/1/2012 100 0.95949406 -0.4755931
Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?
You can pass a data.frame to duplicated
In your case, you want to pass the first 3 columns of test
test2 <- test[!duplicated(test[,1:3]),]
If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique
library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)
For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table
Thanks! Looks like we can do:
test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]
and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter
You can use distinct() from the dplyr package:
library(dplyr)
test %>%
distinct(UNIT, DATE, OUT1)
Or without the %>% pipe:
distinct(test, UNIT, DATE, OUT1)
来源:https://stackoverflow.com/questions/15489185/removing-duplicate-units-from-data-frame