问题
I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset.
I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this:
# Build sample dataset
df1 = data.frame(id=rep(1:6,2)
,classA=rep(c('a','b'),6)
,classB=rep(c(1001:1006),2)
)
df1= df1[order(df1$id),]
df1$classC=c('a',NA,'b',NA,NA,NA,'e','d', NA, 'f', NA, NA)
df1[10,"classB"]=NA
df1=df1[df1$id!=6,]
#sample dataset
> df1
id classA classB classC
1 1 a 1001 a
7 1 a 1001 <NA>
2 2 b 1002 b
8 2 b 1002 <NA>
3 3 a 1003 <NA>
9 3 a 1003 <NA>
4 4 b 1004 e
10 4 b 1004 d
5 5 a 1005 <NA>
11 5 a NA f
# what I'm looking for
> deduplicate(df1, on='id')
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003 <NA>
4 4 b 1004 d
5 4 b 1004 e
6 5 a 1005 f
回答1:
How about this? (solution using data.table
)
require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])
id classA classB classC
1: 1 a 1001 a
2: 2 b 1002 b
3: 3 a 1003 NA
4: 4 b 1004 e
5: 4 b 1004 d
6: 5 a 1005 f
Some explanation:
- the
by = id
part splits/groups your data.tableDT
byid
. .SD
is a read-only variable that automatically picks up each split/group for eachid
one at a time.- we therefore split
DT
byid
, and to each split part, uselapply
(to take each column) and remove allNA
s. Now, if you've let's saya, NA
, then, theNA
gets removed and it returnsa
. But the input was of length 2 (a, NA
). So, it automatically recyclesa
to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both areNA
(likeNA, NA
),NA
s are returned (again through recycling). - If you look at this part
DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id]
, you should be able to understand what has been done. EveryNA
will have been replaced. So, all we need to do is pick-upunique
rows. And that's why it's wrapped withunique
.
Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]
Final solution:
I just realised that the above solution will not work if you've got, for example, for id=4
another row with classC = NA
(and everything else is the same). This happens due to recycling issue. This code should fix it.
unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])
回答2:
I would, first check whether there are row duplicated id and with missing classC and remove them like this :
dd <- df1[duplicated(df1[,1]) & is.na(df1$classC), ]
df1[setdiff(rownames(df1), rownames(dd)), ]
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003 <NA>
4 4 b 1004 e
8 4 b 1004 d
EDIT
I think to generalize the above for many columns , one idea is to put your data in the long format using melt
for example:
library(reshape2)
dat.m <- melt(df1,id.vars='id')
dd <- dat.m[order(dat.m$id),]
rr <- dd[duplicated(dd$id) & is.na(dd$value),]
kk <- dd[setdiff(rownames(dd), rownames(rr)), ]
kk <- kk[!duplicated(kk),]
dcast(kk,id~variable,drop=FALSE,fun.aggregate=list,fill=list(NA))
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003 NA
4 4 b 1004 e, d
5 5 a 1005 f
The final result is slightly different from your desired output, but you can get it with a little work (strsplit for example).
来源:https://stackoverflow.com/questions/17266578/deduplicating-collapsing-records-in-an-r-dataframe