问题
I would like to filter a data frame based on the values present in a second data frame.
For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.
I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.
Data frame 1
CHR BP
29 836019
29 4417047
29 7589996
29 11052921
29 14009294
29 33174196
Data frame 2
start_pos end_pos gene_id
19774 19899 ENSBTAG00000046619
34627 35558 ENSBTAG00000006858
69695 71121 ENSBTAG00000039257
83323 84281 ENSBTAG00000035349
124849 179713 ENSBTAG00000001753
264298 264843 ENSBTAG00000005540
for(j in 1:nrow(tmp_markers)){
temp_out_markers<- tmp_markers[j,]
tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]
if(nrow(tmp_search)>0){
temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
output_genes<-rbind(temp_out,output_genes)
}
}
At the end, I want a data frame with all the rows that are within my tested intervals.
回答1:
As I stated in a comment, your mock data won't result in a match, as the smallest BP
value (836019) is larger than the largest end_pos
(264843).
It could be also that I misunderstood altogether your problem!
I understand that you want to match the rows in df1
to those in df2
for which BP >= start_pos
and BP <= end_pos
. If it's so, we can achieve that using the non-equi joins provided by package data.table
.
library(data.table)
result <- dt1[dt2,
.(BP, CHR, gene_id),
on = .(BP >= start_pos, BP <= end_pos),
nomatch = NULL,
by = .EACHI]
setnames(result, 1:2, names(dt2)[1:2])
result
start_pos end_pos BP CHR gene_id
1: 0.000000 2.000000 0 29 ABCD01
2: 4.571429 6.571429 6 30 ABCD03
3: 11.428571 13.428571 12 31 ABCD06
4: 16.000000 18.000000 18 32 ABCD08
5: 22.857143 24.857143 24 33 ABCD011
6: 29.714286 31.714286 30 34 ABCD014
In case you need the full 15 rows of dt2
, simply omit the nomatch = NULL
part.
DATA USED:
dt1 <- data.table(CHR = 29:34,
BP = seq(0, 30, length.out = 6),
key = "BP")
dt2 <- data.table(start_pos = seq(0, 32, length.out = 15),
gene_id = paste0("ABCD", rep(0, 3), 1:15))
dt2[, end_pos := start_pos + 2]
setcolorder(dt2, c(1, 3, 2))
Alternative with foverlaps
As @r2evans mentioned in a comment, data.table
has another function, foverlaps
than can be useful here. It checks if a range overlaps with one in another table, so we need to do a small trick to create a 0-width range in dt1
:
dt1[, BP2 := BP]
We also need to have keyed data.tables:
setkey(dt1, "BP", "BP2")
setkey(dt2, "start_pos", "end_pos")
And then pass it to the working horse:
foverlaps(dt1, dt2)
start_pos end_pos gene_id CHR BP BP2
1: 0.000000 2.000000 ABCD01 29 0 0
2: 4.571429 6.571429 ABCD03 30 6 6
3: 11.428571 13.428571 ABCD06 31 12 12
4: 16.000000 18.000000 ABCD08 32 18 18
5: 22.857143 24.857143 ABCD011 33 24 24
6: 29.714286 31.714286 ABCD014 34 30 30
Of course we can get rid of BP2
later on by BP2 := NULL
.
If we want the full 15 rows of dt2
, the it's just inverting the order of the objects in the call:
foverlaps(dt2, dt1)
回答2:
Thank you very much!
I ended with this solution and it is working very well.
foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y =
key(tmp_markers),nomatch = 0)
Cheers.
来源:https://stackoverflow.com/questions/57876463/non-equi-joins-comparing-two-data-frames-in-r