Deduplicating/collapsing records in an R dataframe

☆樱花仙子☆ 提交于 2019-12-04 12:58:52

How about this? (solution using data.table)

require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])

   id classA classB classC
1:  1      a   1001      a
2:  2      b   1002      b
3:  3      a   1003     NA
4:  4      b   1004      e
5:  4      b   1004      d
6:  5      a   1005      f

Some explanation:

  • the by = id part splits/groups your data.table DT by id.
  • .SD is a read-only variable that automatically picks up each split/group for each id one at a time.
  • we therefore split DT by id, and to each split part, use lapply (to take each column) and remove all NAs. Now, if you've let's say a, NA, then, the NA gets removed and it returns a. But the input was of length 2 (a, NA). So, it automatically recycles a to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both are NA (like NA, NA), NAs are returned (again through recycling).
  • If you look at this part DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id], you should be able to understand what has been done. Every NA will have been replaced. So, all we need to do is pick-up unique rows. And that's why it's wrapped with unique.

Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]


Final solution:

I just realised that the above solution will not work if you've got, for example, for id=4 another row with classC = NA (and everything else is the same). This happens due to recycling issue. This code should fix it.

unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])

I would, first check whether there are row duplicated id and with missing classC and remove them like this :

dd <- df1[duplicated(df1[,1]) & is.na(df1$classC), ]
df1[setdiff(rownames(df1), rownames(dd)), ]
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      e
8  4      b   1004      d

EDIT

I think to generalize the above for many columns , one idea is to put your data in the long format using melt for example:

library(reshape2)
dat.m  <- melt(df1,id.vars='id')
dd <- dat.m[order(dat.m$id),]
rr <- dd[duplicated(dd$id) & is.na(dd$value),]
kk <- dd[setdiff(rownames(dd), rownames(rr)), ]
kk <- kk[!duplicated(kk),]
dcast(kk,id~variable,drop=FALSE,fun.aggregate=list,fill=list(NA))
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003     NA
4  4      b   1004   e, d
5  5      a   1005      f

The final result is slightly different from your desired output, but you can get it with a little work (strsplit for example).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!