How to merge two pandas dataframes (or transfer values) by comparing ranges of values

问题

In the following data:

data01 =

contig  start    end    haplotype_block 
2   5207    5867    1856
2   155667    155670    2816
2   67910    68022  2
2   68464    68483  3
2   525    775  132
2   118938    119559    1157

data02 =

contig    start   last    feature gene_id gene_name   transcript_id
2   5262    5496    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   5579    5750    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   5856    6032    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   6115    6198    exon    scaffold_200003.1   CP5 scaffold_200003.1
2   916 1201    exon    scaffold_200001.1   NA  scaffold_200001.1
2   614 789 exon    scaffold_200001.1   NA  scaffold_200001.1
2   171 435 exon    scaffold_200001.1   NA  scaffold_200001.1
2   2677    2806    exon    scaffold_200002.1   NA  scaffold_200002.1
2   2899    3125    exon    scaffold_200002.1   NA  scaffold_200002.1

Problem:

I want to compare the ranges (start - end) from these two data frames.
If the ranges overlap I want to transfer the gene_id and gene_name values from data02 to to a new column in the data01.

I tried (using pandas):

data01['gene_id'] = ""
data01['gene_name'] = ""

data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
        if range(data01['start'], data01['end'])\
           <= range(data02['start'], data02['last']) else 'NA')

How can I improve this code? I am currently sticking to pandas, but if the problem is better addressed using dictionary I am open to it. But, please explain the process, I am open to learning rather than just getting an answer.

Thanks,

Desired output:

contig  start    end    haplotype_block    gene_id    gene_name
2   5207    5867    1856    scaffold_200003.1,scaffold_200003.1,scaffold_200003.1   CP5,CP5,CP5

# the gene_id and gene_name are repeated 3 times because three intervals (i.e 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867)

# So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name.

# and simply NA on gene_id and gene_name for non overlapping ranges

2   155667    155670    2816    NA    NA
2   67910    68022  2    NA    NA
2   68464    68483  3    NA    NA
2   525    775  132    scaffold_200001.1   NA
2   118938    119559    1157    NA    NA

回答1:

s1 = data01.start.values
e1 = data01.end.values
s2 = data02.start.values
e2 = data02['last'].values

overlap = (
    (s1[:, None] <= s2) & (e1[:, None] >= s2)
) | (
    (s1[:, None] <= e2) & (e1[:, None] >= e2)
)

g = data02.gene_id.values
n = data02.gene_name.values

i, j = np.where(overlap)
idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)}

def make_series(m):
    s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join)
    return s.rename_axis(idx_map).replace('', np.nan)

data01.assign(
    gene_id=make_series(g),
    gene_name=make_series(n),
)

回答2:

I realize you are using python, but your problem may be easily addressed using the classic bioinformatic tool bedtools intersect: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Both of your input files follow standard BED format: http://bedtools.readthedocs.io/en/latest/content/general-usage.html

Bedtools intersect gives you advanced logic for how to determine what constitutes an intersection or overlap between two regions. I believe it can also operate directly on bgzipped input.

回答3:

You should use interval trees function in python they are very efficient and memory friendly, i tried something similar ran it to some issue which was later solved but here is the code I wrote, Using Interval tree to find overlapping regions

you can build up on this code.

来源：https://stackoverflow.com/questions/43475370/how-to-merge-two-pandas-dataframes-or-transfer-values-by-comparing-ranges-of-v

标签

python

pandas

dataframe

merge

bioinformatics