问题
I have two sets of data.
Sample of set_A
(total number of rows: 45467):
ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000
Sample of set_B
(total number of rows: 4798):
ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889
I am interested in a result where to the set_B
, rows from set_A
are matched by the nearest values of time_a
and time_b
(total number of output rows: 4798). In set_A
values of time_a
could be repeated several times (eg. ID_a[8,]
and [ID_a[9,]
) - it doesn't really matter which row will be merged with a row from set_B
(in this case ID_b[1,]
). Example of the expected result:
ID_b b1 b2 source time_b ID_a a1 a2 a3 time_a
2 34.20 15.114 set1.csv.1 20.35750 8 85640 5274.1 301.6041 20.01000
7 67.20 16.114 set1.csv.2 21.35778 7 85697 5345.2 301.6043 21.00972
12 12.20 33.114 set1.csv.3 22.35806 4 65694 9375.2 301.6049 22.00972
17 73.20 67.114 set2.csv.1 23.35833 3 85694 9278.9 301.6051 23.00972
23 88.20 42.114 set2.csv.2 19.35861 5 85653 4375.5 301.6047 19.00972
28 90.20 52.114 set3.csv.1 00.35889 2 35694 5245.2 301.6053 00.00944
I went through many similar questions on stackoverflow and I really like data.table
library codes as they look really elegant. However, I made several failed attempts where I received either a table constructed based on two sets (total no. of rows 45467) or merged only one column time_a
to the set_B
... Nevertheless, I won't be picky and if somebody has another idea I would be very grateful for help.
Example of a code on which I'm working on:
setDT(set_B)
setDT(set_A)
setkey(set_B, time_b) [, time_a:=time_b]
test_ab <- set_B[set_A, roll='nearest']
As a result I am receiving not only a table with data which should be neglaced but also "a mess" in column names (e.g. a column which contains ID_a
values is called time_a
).
I would really appricieate your help!
回答1:
Here is a step-by-step example based on the sample data you give:
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)
# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]
unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944
Two comments:
- We create a new
time
column to avoid losing one of the original time columns fromset_A
andset_B
. You can always remove thetime
column after the join if required. - We use
unique
to remove duplicatedtime
rows by order ofID_b
. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.
Update (thanks to @Henrik)
As @Henrik pointed out, what you're after is actually a rolling join of set_A
with respect to set_B
, in which case you don't need to deal with the duplicate rows.
That translates to
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
Sample data
set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)
set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)
来源:https://stackoverflow.com/questions/54013468/merging-two-sets-of-data-by-data-table-roll-nearest-function