Finding overlap in ranges with R

孤街醉人 提交于 2019-11-27 11:50:59

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
Arun

The data.table package has a function foverlaps() which is capable of merging over interval ranges since v1.9.4:

require(data.table)
setDT(rangesA)
setDT(rangesB)

setkey(rangesB)
foverlaps(rangesA, rangesB, type="within", nomatch=0L)
#    chrom start stop i.start i.stop
# 1:     5    99  106     100    105
# 2:     1   200  265     200    250
  • setDT() converts data.frame to data.table by reference

  • setkey() sorts the data.table by the columns provided (in this case all columns, since we did not provide any), and marks those columns as sorted, which we'll use later to perform the join on.

  • foverlaps() does the overlapping join efficiently. See this answer for a detailed explanation and comparison to other approaches.

I add the dplyr solution.

library(dplyr)
inner_join(rangesA, rangesB, by="chrom") %>% 
  filter(start.y < start.x | stop.y > stop.x)

Output:

  chrom start.x stop.x start.y stop.y
1     5     100    105      99    106
2     1     200    250     200    265

For your example data:

rangesA <- data.frame(
    chrom = c(5, 1, 9),
    start = c(100, 200, 275),
    stop = c(105, 250, 300)
)
rangesB <- data.frame(
    chrom = c(1, 5, 9),
    start = c(200, 99, 275),
    stop = c(265, 106, 290)
)

This will do it with sapply, such that each column is one row in rangesA and each row is corresponding row in rangesB:

> sapply(rangesA$stop, '>=', rangesB$start) & sapply(rangesA$start, '<=', rangesB$stop)
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE FALSE
[3,] FALSE FALSE  TRUE
mikyatope

RangesA and RangesB are clearly BED syntax, this can be done outside R in the command line with BEDtools, extremely fast and flexible with a dozen other options to work with genomic intervals. Then put the results back again into R.

https://code.google.com/p/bedtools/

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!