Deduplicating/collapsing records in an R dataframe

懵懂的女人 提交于 2019-12-21 20:43:28

问题


I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset.

I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this:

# Build sample dataset
df1 = data.frame(id=rep(1:6,2)                 
                ,classA=rep(c('a','b'),6)
                ,classB=rep(c(1001:1006),2)
                )
df1= df1[order(df1$id),]
df1$classC=c('a',NA,'b',NA,NA,NA,'e','d', NA, 'f', NA, NA)
df1[10,"classB"]=NA
df1=df1[df1$id!=6,]

#sample dataset
> df1
   id classA classB classC
1   1      a   1001      a
7   1      a   1001   <NA>
2   2      b   1002      b
8   2      b   1002   <NA>
3   3      a   1003   <NA>
9   3      a   1003   <NA>
4   4      b   1004      e
10  4      b   1004      d
5   5      a   1005   <NA>
11  5      a     NA      f        

# what I'm looking for
> deduplicate(df1, on='id')
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      d
5  4      b   1004      e
6  5      a   1005      f     

回答1:


How about this? (solution using data.table)

require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])

   id classA classB classC
1:  1      a   1001      a
2:  2      b   1002      b
3:  3      a   1003     NA
4:  4      b   1004      e
5:  4      b   1004      d
6:  5      a   1005      f

Some explanation:

  • the by = id part splits/groups your data.table DT by id.
  • .SD is a read-only variable that automatically picks up each split/group for each id one at a time.
  • we therefore split DT by id, and to each split part, use lapply (to take each column) and remove all NAs. Now, if you've let's say a, NA, then, the NA gets removed and it returns a. But the input was of length 2 (a, NA). So, it automatically recycles a to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both are NA (like NA, NA), NAs are returned (again through recycling).
  • If you look at this part DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id], you should be able to understand what has been done. Every NA will have been replaced. So, all we need to do is pick-up unique rows. And that's why it's wrapped with unique.

Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]


Final solution:

I just realised that the above solution will not work if you've got, for example, for id=4 another row with classC = NA (and everything else is the same). This happens due to recycling issue. This code should fix it.

unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])



回答2:


I would, first check whether there are row duplicated id and with missing classC and remove them like this :

dd <- df1[duplicated(df1[,1]) & is.na(df1$classC), ]
df1[setdiff(rownames(df1), rownames(dd)), ]
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      e
8  4      b   1004      d

EDIT

I think to generalize the above for many columns , one idea is to put your data in the long format using melt for example:

library(reshape2)
dat.m  <- melt(df1,id.vars='id')
dd <- dat.m[order(dat.m$id),]
rr <- dd[duplicated(dd$id) & is.na(dd$value),]
kk <- dd[setdiff(rownames(dd), rownames(rr)), ]
kk <- kk[!duplicated(kk),]
dcast(kk,id~variable,drop=FALSE,fun.aggregate=list,fill=list(NA))
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003     NA
4  4      b   1004   e, d
5  5      a   1005      f

The final result is slightly different from your desired output, but you can get it with a little work (strsplit for example).



来源:https://stackoverflow.com/questions/17266578/deduplicating-collapsing-records-in-an-r-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!