I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:
df1
ID hel
Here's an approach that involves melt
ing your data, merging the molten data, and using dcast
to get it back to a wide form. I've added comments to help understand what is going on.
## Required packages
library(data.table)
library(reshape2)
dcast.data.table(
merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),
## melt the second data.frame
melt(as.data.table(df2), id.vars = "ID"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
is.na(value.x), value.y, value.x)],
## This is your reshaping formula
ID ~ variable, value.var = "value")
# ID hello world football baseball hockey soccer
# 1: 1 2 3 43 6 7 4
# 2: 2 5 1 24 32 2 5
# 3: 3 10 8 2 23 8 23
# 4: 4 4 17 5 15 5 12
# 5: 5 9 7 12 23 3 43
Nobody's posted a dplyr
solution, so here's a succinct option in dplyr
. The approach is simply to do a full_join
that combines all rows, then group
and summarise
to remove the redundant missing cells.
library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df1 %>%
full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%
group_by(ID) %>%
summarize_all(na.omit)
#> # A tibble: 5 x 7
#> ID hello world hockey soccer football baseball
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 7 4 43 6
#> 2 2 5 1 2 5 24 32
#> 3 3 10 8 8 23 2 23
#> 4 4 4 17 5 12 5 15
#> 5 5 9 7 3 43 12 2
Created on 2018-07-13 by the reprex package (v0.2.0).
Here's another data.table
approach using binary merge
library(data.table)
setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys
df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join
df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values
# ID hello world football baseball hockey soccer
# 1: 1 2 3 43 6 7 4
# 2: 2 5 1 24 32 2 5
# 3: 3 10 8 2 23 8 23
# 4: 4 4 17 5 15 5 12
# 5: 5 9 7 12 23 3 43
@ananda-mahto 's answer is more elegant but here is my suggestion:
library(reshape2)
df1=melt(df1,id='ID',na.rm=TRUE)
df2=melt(df2,id='ID',na.rm=TRUE)
DF=rbind(df1,df2)
# Not needeed, added na.rm=TRUE based on @ananda-mahto's valid comment
# DF<-DF[!is.na(DF$value),]
dcast(DF,ID~variable,value.var='value')
Using tidyverse
we could use coalesce
.
None of the solutions below builds extra rows, data stays more or less of the same size and similar shape throughout the chain.
Solution 1
list(df1,df2) %>%
transpose(union(names(df1),names(df2))) %>%
map_dfc(. %>% compact %>% invoke(coalesce,.))
# # A tibble: 5 x 7
# ID hello world football baseball hockey soccer
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 3 43 6 7 4
# 2 2 5 1 24 32 2 5
# 3 3 10 8 2 23 8 23
# 4 4 4 17 5 15 5 12
# 5 5 9 7 12 23 3 43
Explanations
list
transpose
it, so each new item at the root has the name of a column of the output. Default behavior of transpose
is to take the first argument as a template so unfortunately we have to be explicit to get all of them.compact
these items, as they were all of length 2, but with one of them being NULL
when the given column was missing on one side.coalesce
those, which basically means return the first non NA
you find, when putting arguments side by side.if repeating df1
and df2
on the second line is an issue, use the following instead:
transpose(invoke(union, setNames(map(., names), c("x","y"))))
Solution 2
Same philosophy, but this time we loop on names:
map_dfc(set_names(union(names(df1), names(df2))),
~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]]))))
# # A tibble: 5 x 7
# ID hello world football baseball hockey soccer
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 3 43 6 7 4
# 2 2 5 1 24 32 2 5
# 3 3 10 8 2 23 8 23
# 4 4 4 17 5 15 5 12
# 5 5 9 7 12 23 3 43
Here it is once pipified for those who may prefer :
union(names(df1), names(df2)) %>%
set_names %>%
map_dfc(~ list(df1[[.x]], df2[[.x]]) %>%
compact %>%
invoke(coalesce, .))
Explanations
set_names
gives to character vector names identical to its values, so map_dfc
can name the output's columns right.df1[[.x]]
will return NULL
when .x
is not a column of df1
, we take advantage of this.df1
and df2
are mentioned 2 times each and I can't think of any way around it.Solution 1 is cleaner in respect to these points so I recommend it.
We could use my package safejoin, do a left join and deal with the conflicts using dplyr::coalesce
# # devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_left_join(df1, df2, by = "ID", conflict = coalesce)
# # A tibble: 5 x 7
# ID hello world hockey soccer football baseball
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 2 3 7 4 43 6
# 2 2 5 1 2 5 24 32
# 3 3 10 8 8 23 2 23
# 4 4 4 17 5 12 5 15
# 5 5 9 7 3 43 12 2