Merge overlapping numeric ranges into continuous ranges

一世执手 提交于 2019-12-21 02:54:55

问题


I am trying to merge a range of genomic coordinates into continuous ranges, with an additional option for merging across gaps.

For example, if I had the genomic ranges [[0, 1000], [5, 1100]] I would want the result to be [0, 1100]. If the offset option was set to 100, and the input was [[0, 1000], [1090, 1000]] I would once again want the result to be [0, 1100].

I have implemented a way of doing this that steps through the alignments sequentially and tries to merge on the previous ending position and next starting position, but it fails because the actual results have varying lengths. For example I have the results [[138, 821],[177, 1158], [224, 905], [401, 1169]] in my list sorted by the start positions. The answer to that should be [138, 1169] but I instead get [[138, 1158], [177, 905], [224, 1169]]. Obviously I need to take more into account than just the previous ending and the next start, but I haven't found a good solution (preferably one that isn't a huge nest of if statements). Anyone have any suggestions?

def overlap_alignments(align, gene, overlap):
    #make sure alignments are sorted first by chromosome then by start pos on chrom
    align = sorted(align, key = lambda x: (x[0], x[1]))
    merged = list()
    for i in xrange(1, len(align)):
        prv, nxt = align[i-1], align[i]
        if prv[0] == nxt[0] and prv[2] + overlap >= nxt[1]:
            start, end = prv[1], nxt[2]
            chrom = prv[0]
            merged.append([chrom, start, end, gene])
    return merged

回答1:


Well, how about keeping track of every start and end and the number of ranges where each position belongs to?

def overlap_alignments(align, overlap):
    # create a list of starts and ends
    stends = [ (a[0], 1) for a in align ]
    stends += [ (a[1] + overlap, -1) for a in align ]
    stends.sort(key=lambda x: x[0])

    # now we should have a list of starts and ends ordered by position,
    # e.g. if the ranges are 5..10, 8..15, and 12..13, we have
    # (5,1), (8,1), (10,-1), (12,1), (13,-1), (15,-1)

    # next, we form a cumulative sum of this
    s = 0
    cs = []
    for se in stends:
        s += se[1]
        cs.append((se[0], s))
    # this is, with the numbers above, (5,1), (8,2), (10,1), (12,2), (13,1), (15,0)
    # so, 5..8 belongs to one range, 8..10 belongs to two overlapping range,
    # 10..12 belongs to one range, etc

    # now we'll find all contiguous ranges
    # when we traverse through the list of depths (number of overlapping ranges), a new
    # range starts when the earlier number of overlapping ranges has been 0
    # a range ends when the new number of overlapping ranges is zero 
    prevdepth = 0
    start = 0
    combined = []
    for pos, depth in cs:
        if prevdepth == 0:
            start = pos
        elif depth == 0
            combined.append((start, pos-overlap))
        prevdepth = depth

    return combined

This would be easier to draw than to explain. (And yes, the cumulative sum could be written in a shorter space, but i find it clearer this way.)

To explain this graphically, lets take input ([5,10],[8,15],[12,13],[16,20]) and overlap=1.

.....XXXXXo.............. (5-10)
........XXXXXXXo......... (8-15)
............Xo........... (12-13)
................XXXXo.... (16-20)
.....1112221221111111.... number of ranges at each position
.....----------------.... number of ranges > 0
.....---------------..... overlap corrected (5-20)



回答2:


Python comes batteries included:

from itertools import chain

flatten = chain.from_iterable

LEFT, RIGHT = 1, -1

def join_ranges(data, offset=0):
    data = sorted(flatten(((start, LEFT), (stop + offset, RIGHT))
            for start, stop in data))
    c = 0
    for value, label in data:
        if c == 0:
            x = value
        c += label
        if c == 0:
            yield x, value - offset

if __name__ == '__main__':
    print list(join_ranges([[138, 821], [900, 910], [905, 915]]))
    print list(join_ranges([[138, 821], [900, 910], [905, 915]], 80))

Result:

[(138, 821), (900, 915)]
[(138, 915)]

How it works: we label every start and end point as such, then we sort and afterwards we simply count up for every start point and down for every end point. If we have visited the same number of start and end points, we have a closed (joined) range.



来源:https://stackoverflow.com/questions/24317211/merge-overlapping-numeric-ranges-into-continuous-ranges

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!