Find point-to-range overlaps

半世苍凉 提交于 2019-12-01 22:46:41

Using GenomicRanges:

#Convert to Granges objects
gr1 <- GRanges(seqnames = df1$V1,
               ranges = IRanges(df1$V2, df1$V2))

gr2 <- GRanges(seqnames = df2$V1,
               ranges = IRanges(df2$V2, df2$V3))
#Subset gr1
subsetByOverlaps(gr1, gr2)

# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames             ranges strand
#          <Rle>          <IRanges>  <Rle>
#  [1]    Chr06 [  82862,   82862]      *
#  [2]    Chr06 [ 387314,  387314]      *
#  [3]    Chr06 [1018696, 1018696]      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#Or we can use merge
mergeByOverlaps(gr1, gr2)

# DataFrame with 3 rows and 2 columns
#                          gr1                        gr2
#                    <GRanges>                  <GRanges>
# 1 Chr06:*:[  82862,   82862] Chr06:*:[  79720,   87043]
# 2 Chr06:*:[ 387314,  387314] Chr06:*:[ 387314,  387371]
# 3 Chr06:*:[1018696, 1018696] Chr06:*:[1018676, 1018736]

Also, look into bedtools:

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

Here is a data.table solution as an alternative to GenomicRanges:

library(data.table)
dt1 <- data.table(df1)[, V3 := V2]
dt2 <- data.table(df2, key = c("V2", "V3"))
foverlaps(dt1, dt2)[V1 == i.V1][, -c(4, 6), with = F]
#       V1      V2      V3    i.V3
# 1: Chr06   79720   87043   82862
# 2: Chr06  387314  387371  387314
# 3: Chr06 1018676 1018736 1018696

You can do this using sapply:

sapply(1:nrow(df1), function(x) any(df1[x,2] >= df2$V2 &
                                    df1[x,2] <= df2$V3 &
                                    df1[x, 1] == df2$V1))
[1] FALSE  TRUE  TRUE FALSE FALSE  TRUE
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!