How to merge two data frames on common columns in R with sum of others?

前端 未结 3 1064
暗喜
暗喜 2020-12-02 23:02

R Version 2.11.1 32-bit on Windows 7

I got two data sets: data_A and data_B:

data_A

USER_A USER_B ACTION
1      11     0.3
1      13     0.         


        
相关标签:
3条回答
  • 2020-12-02 23:44

    I wrote the package safejoin which solves this very succintly :

    # devtools::install_github("moodymudskipper/safejoin")
    library(safejoin)
    safe_left_join(data_A,data_B, by = c("USER_A", "USER_B"), 
                   conflict = ~ .x+ ifelse(is.na(.y),0,.y))
    #   USER_A USER_B ACTION
    # 1      1     11   0.30
    # 2      1     13   0.42
    # 3      1     16   0.63
    # 4      1     17   0.26
    # 5      2     11   0.39
    # 6      2     14   0.28
    

    In case of conflict, the function fed to the conflict argument will be used on pairs of conflicting columns

    0 讨论(0)
  • 2020-12-02 23:58

    This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf to do a left (outer) join and then summarise the resulting object:

    require(sqldf)
    tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")
    

    This results in:

    > tmp
      USER_A USER_B ACTION ACTION
    1      1     11   0.30     NA
    2      1     13   0.25   0.17
    3      1     16   0.63     NA
    4      1     17   0.26     NA
    5      2     11   0.14   0.25
    6      2     14   0.28     NA
    

    Now we just need sum the two ACTION columns:

    data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))
    

    Which gives the desired result:

    > data_C
      USER_A USER_B ACTION
    1      1     11   0.30
    2      1     13   0.42
    3      1     16   0.63
    4      1     17   0.26
    5      2     11   0.39
    6      2     14   0.28
    

    This can be done using standard R function merge:

    > merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
      USER_A USER_B ACTION.x ACTION.y
    1      1     11     0.30       NA
    2      1     13     0.25     0.17
    3      1     16     0.63       NA
    4      1     17     0.26       NA
    5      2     11     0.14     0.25
    6      2     14     0.28       NA
    

    So we can replace the sqldf() call above with:

    tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
    

    whilst the second line using transform() remains the same.

    0 讨论(0)
  • 2020-12-03 00:07

    You can use ddply in package plyr and combine it with merge:

    library(plyr)
    ddply(merge(data_A, data_B, all.x=TRUE), 
      .(USER_A, USER_B), summarise, ACTION=sum(ACTION))
    

    Notice that merge is called with the parameter all.x=TRUE - this returns all of the values in the first data.frame passed to merge, i.e. data_A:

      USER_A USER_B ACTION
    1      1     11   0.30
    2      1     13   0.25
    3      1     16   0.63
    4      1     17   0.26
    5      2     11   0.14
    6      2     14   0.28
    
    0 讨论(0)
提交回复
热议问题