Merging DataFrames on multiple conditions - not specifically on equal values

混江龙づ霸主 提交于 2019-11-29 04:26:46

I've just thought of a way to solve this - by combining my two methods:

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

You can use the following to accomplish what you're looking for:

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

snp_df
Out[193]: 
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972   30400
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

gene_df
Out[194]: 
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

merged_df
Out[195]: 
         SNP        feature_id
8  rs3131972  GeneID:100302278
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!