Matching based on different independent tables using data.table in R

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-01 09:09:06

问题


I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way?

I have a mock example, with some mock conditions here to illustrate my question:

main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10),
                         age = c(24,35,43,34,55,24,36,43,34,54),
                         gender = c("f","m","f","f","m","f","m","f","f","m"))

data_1 <- data.frame( pnum = c(1,4,5,8,9),
                      value_data_1 = c(1, 2, 1, 1, 1),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

data_2 <- data.frame( pnum = c(1,5,7,8,9),
                      value_data_2 = c(1, 2, 1, 1, 2),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

I would like to create a new variable in my main_data table called "matching" of those rows that match between data_1 and data_2 under multiple conditions:

  • First, the value of data_1$value_data_1 has to be equal to 1.
  • Second, the value of data_2$value_data_2 also has to be equal to 1.
  • Third, the pnum and the date should match between data_1 and data_2.

When all these conditions are met, I would expect the new output of main_data to look like this:

> main_data
   pnum age gender matching
1     1  24      f        1
2     2  35      m        0
3     3  43      f        0
4     4  34      f        0
5     5  55      m        0
6     6  24      f        0
7     7  36      m        0
8     8  43      f        1
9     9  34      f        0
10   10  54      m        0

Until now, I programmed each condition seperately and created new placeholder tables in between, but this is not very memory efficient. Is there an efficient way to chain all the conditions using the data.tables package specifically?


回答1:


Here's one way:

library(data.table)
library(magrittr)
setDT(main_data)
setDT(data_1)
setDT(data_2)

main_data %>%
  data_1[., on = .(pnum == pnum) ] %>%
  data_2[., on = .(pnum == pnum, date == date) ] %>%
  .[, matching := fcoalesce(+(value_data_1 == 1 & value_data_2 == 1), 0L) ] %>%
  .[, .(pnum, age, gender, matching) ]
#     pnum age gender matching
#  1:    1  24      f        1
#  2:    2  35      m        0
#  3:    3  43      f        0
#  4:    4  34      f        0
#  5:    5  55      m        0
#  6:    6  24      f        0
#  7:    7  36      m        0
#  8:    8  43      f        1
#  9:    9  34      f        0
# 10:   10  54      m        0

I used the magrittr package because I find it useful for portraying the flow code. It is not at all required, and the alternative pipeline for data.table for the same code could be:

data_2[
  data_1[main_data, on = .(pnum == pnum) ]
 ,on = .(pnum == pnum, date == date)
][ ,matching := fcoalesce(+(value_data_1 == 1 & value_data_2 == 1), 0L)
  ][ ,.(pnum, age, gender, matching) ]

There are other ways to break it out, including the use of temporary (mid-step) variables. (This is mostly style and personal preference.)




回答2:


You can use something like Reduce(merge, list(...))

library(data.table)

setDT(main_data); setDT(data_1); setDT(data_2)

res <- Reduce(function(x, y) {
  merge(x, y, by = "pnum", all.x = TRUE)
}, list(main_data, data_1[, -"date"], data_2[, -"date"]))[, `:=`(
  matching = 1L - (value_data_1 != 1 | value_data_2 != 1 | is.na(value_data_1) | is.na(value_data_2)), 
  value_data_1 = NULL,
  value_data_2 = NULL
)]

Output

> res[]
    pnum age gender matching
 1:    1  24      f        1
 2:    2  35      m        0
 3:    3  43      f        0
 4:    4  34      f        0
 5:    5  55      m        0
 6:    6  24      f        0
 7:    7  36      m        0
 8:    8  43      f        1
 9:    9  34      f        0
10:   10  54      m        0


来源:https://stackoverflow.com/questions/65006467/matching-based-on-different-independent-tables-using-data-table-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!