问题
In the following data:
data01 =
contig start end haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157
data02 =
contig start last feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1
Problem:
- I want to compare the ranges (start - end) from these two data frames.
- If the ranges overlap I want to transfer the
gene_id
andgene_name
values from data02 to to a new column in the data01.
I tried (using pandas):
data01['gene_id'] = ""
data01['gene_name'] = ""
data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
if range(data01['start'], data01['end'])\
<= range(data02['start'], data02['last']) else 'NA')
How can I improve this code? I am currently sticking to pandas, but if the problem is better addressed using dictionary I am open to it. But, please explain the process, I am open to learning rather than just getting an answer.
Thanks,
Desired output:
contig start end haplotype_block gene_id gene_name
2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5
# the gene_id and gene_name are repeated 3 times because three intervals (i.e 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867)
# So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name.
# and simply NA on gene_id and gene_name for non overlapping ranges
2 155667 155670 2816 NA NA
2 67910 68022 2 NA NA
2 68464 68483 3 NA NA
2 525 775 132 scaffold_200001.1 NA
2 118938 119559 1157 NA NA
回答1:
s1 = data01.start.values
e1 = data01.end.values
s2 = data02.start.values
e2 = data02['last'].values
overlap = (
(s1[:, None] <= s2) & (e1[:, None] >= s2)
) | (
(s1[:, None] <= e2) & (e1[:, None] >= e2)
)
g = data02.gene_id.values
n = data02.gene_name.values
i, j = np.where(overlap)
idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)}
def make_series(m):
s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join)
return s.rename_axis(idx_map).replace('', np.nan)
data01.assign(
gene_id=make_series(g),
gene_name=make_series(n),
)
回答2:
I realize you are using python, but your problem may be easily addressed using the classic bioinformatic tool bedtools intersect
: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Both of your input files follow standard BED format: http://bedtools.readthedocs.io/en/latest/content/general-usage.html
Bedtools intersect gives you advanced logic for how to determine what constitutes an intersection or overlap between two regions. I believe it can also operate directly on bgzipped input.
回答3:
You should use interval trees function in python they are very efficient and memory friendly, i tried something similar ran it to some issue which was later solved but here is the code I wrote, Using Interval tree to find overlapping regions
you can build up on this code.
来源:https://stackoverflow.com/questions/43475370/how-to-merge-two-pandas-dataframes-or-transfer-values-by-comparing-ranges-of-v