Deduplicating/collapsing records in an R dataframe

问题

I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset.

I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this:

# Build sample dataset
df1 = data.frame(id=rep(1:6,2)                 
                ,classA=rep(c('a','b'),6)
                ,classB=rep(c(1001:1006),2)
                )
df1= df1[order(df1$id),]
df1$classC=c('a',NA,'b',NA,NA,NA,'e','d', NA, 'f', NA, NA)
df1[10,"classB"]=NA
df1=df1[df1$id!=6,]

#sample dataset
> df1
   id classA classB classC
1   1      a   1001      a
7   1      a   1001   <NA>
2   2      b   1002      b
8   2      b   1002   <NA>
3   3      a   1003   <NA>
9   3      a   1003   <NA>
4   4      b   1004      e
10  4      b   1004      d
5   5      a   1005   <NA>
11  5      a     NA      f        

# what I'm looking for
> deduplicate(df1, on='id')
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      d
5  4      b   1004      e
6  5      a   1005      f

回答1:

How about this? (solution using data.table)

require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])

   id classA classB classC
1:  1      a   1001      a
2:  2      b   1002      b
3:  3      a   1003     NA
4:  4      b   1004      e
5:  4      b   1004      d
6:  5      a   1005      f

Some explanation:

the by = id part splits/groups your data.table DT by id.
.SD is a read-only variable that automatically picks up each split/group for each id one at a time.
we therefore split DT by id, and to each split part, use lapply (to take each column) and remove all NAs. Now, if you've let's say a, NA, then, the NA gets removed and it returns a. But the input was of length 2 (a, NA). So, it automatically recycles a to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both are NA (like NA, NA), NAs are returned (again through recycling).
If you look at this part DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id], you should be able to understand what has been done. Every NA will have been replaced. So, all we need to do is pick-up unique rows. And that's why it's wrapped with unique.

Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]

Final solution:

I just realised that the above solution will not work if you've got, for example, for id=4 another row with classC = NA (and everything else is the same). This happens due to recycling issue. This code should fix it.

unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])

回答2:

I would, first check whether there are row duplicated id and with missing classC and remove them like this :

dd <- df1[duplicated(df1[,1]) & is.na(df1$classC), ]
df1[setdiff(rownames(df1), rownames(dd)), ]
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      e
8  4      b   1004      d

EDIT

I think to generalize the above for many columns , one idea is to put your data in the long format using melt for example:

library(reshape2)
dat.m  <- melt(df1,id.vars='id')
dd <- dat.m[order(dat.m$id),]
rr <- dd[duplicated(dd$id) & is.na(dd$value),]
kk <- dd[setdiff(rownames(dd), rownames(rr)), ]
kk <- kk[!duplicated(kk),]
dcast(kk,id~variable,drop=FALSE,fun.aggregate=list,fill=list(NA))
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003     NA
4  4      b   1004   e, d
5  5      a   1005      f

The final result is slightly different from your desired output, but you can get it with a little work (strsplit for example).

来源：https://stackoverflow.com/questions/17266578/deduplicating-collapsing-records-in-an-r-dataframe

标签

join

merge

dataframe