Double left join in dplyr to recover values

微笑、不失礼 提交于 2021-02-16 14:29:40

问题


I've checked this issue but couldn't find a matching entry.

Say you have 2 DFs:

df1:mode   df2:sex
1           1
2           2
3

And a DF3 where most of the combinations are not present, e.g.

mode | sex  | cases      
1        1      9
1        1      2
2        2      7
3        1      2
1        2      5

and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):

  mode | sex  | cases      
    1        1     11
    1        2     5
    2        1     0
    2        2     7
    3        1     2
    3        2     0    

If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).

So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.

Thanks in advance, p.


回答1:


First here's you data in a more friendly, reproducible format

df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))

I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.

mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))

which gives

  mode sex cases
1    1   1    11
2    1   2     5
3    2   1     0
4    2   2     7
5    3   1     2
6    3   2     0



回答2:


The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.

The help page says:

This is a wrapper around expand(), left_join() and replace_na that's useful for completing missing combinations of data. It turns implicitly missing values into explicitly missing values.

To add the missing combinations of df3 and fill with 0 values instead, you would do:

library(tidyr)
library(dplyr)

df3 %>% complete(mode, sex, fill = list(cases = 0))

  mode sex cases
1    1   1     9
2    1   1     2
3    1   2     5
4    2   1     0
5    2   2     7
6    3   1     2
7    3   2     0

You would still need to group_by and summarise to get the final output you want.

df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
    group_by(mode, sex) %>%
    summarise(cases = sum(cases))

Source: local data frame [6 x 3]
Groups: mode

  mode sex cases
1    1   1    11
2    1   2     5
3    2   1     0
4    2   2     7
5    3   1     2
6    3   2     0


来源:https://stackoverflow.com/questions/30807855/double-left-join-in-dplyr-to-recover-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!