Merge 2 dataframes by matching dates

末鹿安然 提交于 2020-01-22 10:00:58

问题


I have two dataframes:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

code: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.


回答1:


Instead of using merge with data.table, you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun.




回答2:


I'd update df1 directly by reference as follows:

require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") := 
                .(field1, field2), on=c("id", "dates")]

> df1
#       id      dates fiedl1 field2
# 1: MUM-1 2015-07-10      1      0
# 2: MUM-1 2015-07-11     NA     NA
# 3: MUM-1 2015-07-12      2      1
# 4: MUM-2 2014-01-14      4      3
# 5: MUM-2 2014-01-15     NA     NA
# 6: MUM-2 2014-01-16     NA     NA
# 7: MUM-2 2014-01-17      0      1

This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.


Updated with a slightly bigger dataset than @Jaap's updated benchmark:

set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]

# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
#    user  system elapsed 
#   0.209   0.067   0.277 
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
#    user  system elapsed 
# 119.911   0.530 120.749 

# this answer    
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
#    user  system elapsed 
#   0.087   0.013   0.100 

sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2     

# loaded via a namespace (and not attached):
# [1] magrittr_1.5   R6_2.1.0       assertthat_0.1 parallel_3.2.1 DBI_0.3.1     
# [6] tools_3.2.1    Rcpp_0.12.0    chron_2.3-45  

Dint expect dplyr to be ~1200x slower though.




回答3:


You can convert both data frames to data tables, and then perform a merge:

library(data.table)
setDT(df1); setDT(df2)

merge(df1, df2, by = "id", allow.cartesian = TRUE)

the allow.cartesian part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table.




回答4:


I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.

You can also use dplyr's left_join with data.frames, as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables. In other words, use dplyr with a data.table structure in the background.

As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key. Then I'll perform various merges and count the time:

library(data.table)
library(dplyr)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)


# check your current data tables (note some have keys)
tables()


# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm


# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm


# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm

# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm



# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm


# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm


来源:https://stackoverflow.com/questions/31876335/merge-2-dataframes-by-matching-dates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!