Collapse intersecting regions

后端 未结 2 950
别那么骄傲
别那么骄傲 2020-12-03 02:24

I am trying to find a way to collapse rows with intersecting ranges, denoted by \"start\" and \"stop\" columns, and record the collapsed values into new columns. For example

2条回答
  •  感情败类
    2020-12-03 02:37

    IRanges is a good candidate for such job. No need to use chrom variable.

    ir <- IRanges(my.df$start, my.df$stop)
    ## I create a new grouping variable Note the use of reduce here(performance issue)
    my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
    # chrom name    start     stop group2
    # 1     1    a    70001    71200      2
    # 2     1    b    70203    80001      2
    # 3     1    c    70060    71051      2
    # 4    14    d    40004    42004      1
    # 5    16    e 50000872 50000890      3
    # 6    16    f 50000872 51000952      3
    

    The new group2 variable is the range indicator. Now using data.table I can't transform my data to the desired output:

    library(data.table)
    DT <- as.data.table(my.df)
    DT[, list(start=min(start),stop=max(stop),
             name=list(name),chrom=unique(chrom)),
                   by=group2]
    
    # group2    start     stop  name chrom
    # 1:      2    70001    80001 a,b,c     1
    # 2:      1    40004    42004     d    14
    # 3:      3 50000872 51000952   e,f    16
    

    PS: the collapsed variable name here is not string but a list of factor. This is more efficient and easier to access than a collapased character using paste for example.

    EDIT after OP clarification, I will create the group variable by chrom. I mean the Iranges code now is called for each chrom group. I slightly modify your data, to create group of intervals the same chromosome.

    my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), 
                       name=c("a","b","c","d","e","f","g"),
                       start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), 
                       stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))
    
    library(data.table)
    DT <- as.data.table(my.df)
    
    ## find interval for each chromsom
    DT[,group := { 
          ir <-  IRanges(start, stop);
           subjectHits(findOverlaps(ir, reduce(ir)))
          },by=chrom]
    
    ## Now I group by group and chrom 
    DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
       by=list(group,chrom)]
    
      group chrom    start     stop name chrom
    1:     1     1        0        1    a     1
    2:     2     1     3000     5000    b     1
    3:     3     1    70060    80001  c,d     1
    4:     1    14    40004    42004    e    14
    5:     1    16 50000872 51000952  f,g    16
    

提交回复
热议问题