问题
I have a dataset (mydata
) that contains multiple columns which could fit inside ranges that are stored in another dataset (mycomparison
).
I would like to join mycomparison
to mydata
where the mydata
values are within the ranges in mycomparison
.
MWE
library(data.table)
mydata<-data.table(
id=1:5,
val1=seq(10000, 50000, by=10000),
val2=floor(rnorm(5,mean=400,sd=100)),
val3=rnorm(5,mean=.7,sd=.1)
)
mycomparison<-data.table(
Name=LETTERS[1:3],
minval1=c(0,30000,10000),
maxval1=c(50000,80000,30000),
minval2=c(300,400,300),
maxval2=c(800,800,800),
minval3=c(0,.5,.2),
maxval3=c(1,.9,.8),
correspondingval=c(.1,.2,.3)
)
Desired Output
> mydata.withmatches
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.4844319 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.7856313 NA NA NA NA NA NA NA NA
3: 3 30000 324 0.8063969 NA NA NA NA NA NA NA NA
4: 4 40000 263 0.5590113 NA NA NA NA NA NA NA NA
5: 5 50000 187 0.8764396 NA NA NA NA NA NA NA NA
Current solution
This feels/is very clunky and involves cross-joining the data (using optiRum::CJ.dt
), doing a big logical check, and then reassembling the data.
library(optiRum)
workingdt<-CJ.dt(mydata,mycomparison)
matched<-workingdt[val1>=minval1 &
val1<=maxval1 &
val2>=minval2 &
val2<=maxval2 &
val3>=minval3 &
val3<=maxval3][which.min(correspondingval)]
notmatched<-mydata[id!= matched[,id]]
all<-list(matched,notmatched)
mydata.withmatches<- rbindlist(all, fill=TRUE, use.names=TRUE)
Looking for a better solution - UPDATED
I'm aware of foverlaps
but it will work on a single interval, not on many ranges like in this instance.
I'm hoping for a less clunky and more elegant solution.
回答1:
Nice question! Below you can find my quick fix, but it's also still a bit too cluncky for my taste.
The generated mydata
set:
id val1 val2 val3
1: 1 10000 377 0.7912443
2: 2 20000 378 0.7709792
3: 3 30000 484 0.7049517
4: 4 40000 513 0.5169590
5: 5 50000 474 0.7987448
A simple function to filter away the non-matching rows of the comparison dataset (multiple rows can match).
library(dplyr)
find_interval_func<-function(var.min.name, var.max.name, value, val.to.return){
compset<-data.frame(mycomparison)
for(i in 1:length(var.min.name)){
compset<-
compset %>%
filter_(paste0(var.min.name[[i]], "<=", value[[i]]),
paste0(var.max.name[[i]], ">", value[[i]]))
}
paste(compset[,val.to.return], collapse="|")
}
And the result:
> mydata %>%
+ group_by(1:n()) %>%
+ mutate(matchedValue = find_interval_func(c("minval1", "minval2", "minval3"),
+ c("maxval1", "maxval2", "maxval3"),
+ c(val1, val2, val3),
+ "Name"))
Source: local data table [5 x 6]
id val1 val2 val3 1:n() matchedValue
1 1 10000 377 0.7912443 1 A|C
2 2 20000 378 0.7709792 2 A|C
3 3 30000 484 0.7049517 3 A|B
4 4 40000 513 0.5169590 4 A|B
5 5 50000 474 0.7987448 5 B
回答2:
I do not exactly understand your Desired Output, because multiple id's match the mycomparison
data.table. Using your data (rounded to two decimal places):
> mydata
id val1 val2 val3
1: 1 10000 387 0.48
2: 2 20000 425 0.79
3: 3 30000 324 0.81
4: 4 40000 263 0.56
5: 5 50000 187 0.88
And
> mycomparison
Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: A 0 50000 300 800 0.0 1.0 0.1
2: B 30000 80000 400 800 0.5 0.9 0.2
3: C 10000 30000 300 800 0.2 0.8 0.3
This gives:
> workingdt
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 4 40000 263 0.56 A 0 50000 300 800 0.0 1.0 0.1
5: 5 50000 187 0.88 A 0 50000 300 800 0.0 1.0 0.1
6: 1 10000 387 0.48 B 30000 80000 400 800 0.5 0.9 0.2
7: 2 20000 425 0.79 B 30000 80000 400 800 0.5 0.9 0.2
8: 3 30000 324 0.81 B 30000 80000 400 800 0.5 0.9 0.2
9: 4 40000 263 0.56 B 30000 80000 400 800 0.5 0.9 0.2
10: 5 50000 187 0.88 B 30000 80000 400 800 0.5 0.9 0.2
11: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
12: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
13: 3 30000 324 0.81 C 10000 30000 300 800 0.2 0.8 0.3
14: 4 40000 263 0.56 C 10000 30000 300 800 0.2 0.8 0.3
15: 5 50000 187 0.88 C 10000 30000 300 800 0.2 0.8 0.3
And leaving off your which.min()
:
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
5: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
If you use the data.table group-by functionality, you can pick the min(correspondingval)
for each id
(I am leaving off the unmatched data for the moment):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.min(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0 1 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0 1 0.1
Or, the max(correspondingval)
if you prefer:
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.max(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
2: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
If all you want--as shown in your Desired Output--is the first row with the minimum correspondingval
and everything else with NA
there are easier ways to do this. If you want to know where each id
matches a range--as I have shown in my output--then a cleaner, more elegant solution is different.
Let me know.
回答3:
There is a simple approach which uses the crossing function from tidyr, along with the very useful between function from dplyr. Of course, this works safely so long as at least one of the tables is relatively small, otherwise the crossing can be a memory hog.
library(tidyr)
library(dplyr)
mydata.withmatches <- mydata %>%
crossing(mycomparison) %>%
filter(between(val1, minval1,maxval1) & between(val2, minval2, maxval2) & between(val3, minval3, maxval3)) %>%
full_join(mydata) %>%
arrange(id, Name)
来源:https://stackoverflow.com/questions/28540466/how-to-identify-overlaps-in-multiple-columns