How to merge two data frames on common columns in R with sum of others?

丶灬走出姿态 提交于 2019-11-27 12:25:01

You can use ddply in package plyr and combine it with merge:

library(plyr)
ddply(merge(data_A, data_B, all.x=TRUE), 
  .(USER_A, USER_B), summarise, ACTION=sum(ACTION))

Notice that merge is called with the parameter all.x=TRUE - this returns all of the values in the first data.frame passed to merge, i.e. data_A:

  USER_A USER_B ACTION
1      1     11   0.30
2      1     13   0.25
3      1     16   0.63
4      1     17   0.26
5      2     11   0.14
6      2     14   0.28

This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf to do a left (outer) join and then summarise the resulting object:

require(sqldf)
tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")

This results in:

> tmp
  USER_A USER_B ACTION ACTION
1      1     11   0.30     NA
2      1     13   0.25   0.17
3      1     16   0.63     NA
4      1     17   0.26     NA
5      2     11   0.14   0.25
6      2     14   0.28     NA

Now we just need sum the two ACTION columns:

data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))

Which gives the desired result:

> data_C
  USER_A USER_B ACTION
1      1     11   0.30
2      1     13   0.42
3      1     16   0.63
4      1     17   0.26
5      2     11   0.39
6      2     14   0.28

This can be done using standard R function merge:

> merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
  USER_A USER_B ACTION.x ACTION.y
1      1     11     0.30       NA
2      1     13     0.25     0.17
3      1     16     0.63       NA
4      1     17     0.26       NA
5      2     11     0.14     0.25
6      2     14     0.28       NA

So we can replace the sqldf() call above with:

tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)

whilst the second line using transform() remains the same.

I wrote the package safejoin which solves this very succintly :

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
safe_left_join(data_A,data_B, by = c("USER_A", "USER_B"), 
               conflict = ~ .x+ ifelse(is.na(.y),0,.y))
#   USER_A USER_B ACTION
# 1      1     11   0.30
# 2      1     13   0.42
# 3      1     16   0.63
# 4      1     17   0.26
# 5      2     11   0.39
# 6      2     14   0.28

In case of conflict, the function fed to the conflict argument will be used on pairs of conflicting columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!