Merging DataFrames on multiple conditions - not specifically on equal values

前端 未结 2 980
梦如初夏
梦如初夏 2020-12-17 00:47

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

I am trying to join (merge) t

相关标签:
2条回答
  • 2020-12-17 01:35

    I've just thought of a way to solve this - by combining my two methods:

    First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

    all_dfs = []
    
    for chromosome in snp_df['chromosome'].unique():
        this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
        this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]
    
        # Getting rid of redundant genes
        min_bp      = this_chr_snp['BP'].min()
        max_bp      = this_chr_snp['BP'].max()
        this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
                ~(this_genes['chr_stop'] <= min_bp)]
    
        for line in this_genes.iterrows():
            info     = line[1]
            this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                    (this_chr_snp['BP'] <= info['chr_stop'])]
            if this_snp.shape[0] != 0:
                this_snp    = this_snp[['SNP']]
                this_snp.insert(1, 'feature_id', info['feature_id'])
                all_dfs.append(this_snp)
    
    all_genic_snps  = pd.concat(all_dfs)
    

    While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

    0 讨论(0)
  • 2020-12-17 01:37

    You can use the following to accomplish what you're looking for:

    merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
    merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
    

    Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

    snp_df
    Out[193]: 
       chromosome        SNP      BP
    0           1  rs3094315  752566
    1           1  rs3131972   30400
    2           1  rs2073814  753474
    3           1  rs3115859  754503
    4           1  rs3131956  758144
    
    gene_df
    Out[194]: 
       chromosome  chr_start  chr_stop        feature_id
    0           1      10954     11507  GeneID:100506145
    1           1      12190     13639  GeneID:100652771
    2           1      14362     29370     GeneID:653635
    3           1      30366     30503  GeneID:100302278
    4           1      34611     36081     GeneID:645520
    
    merged_df
    Out[195]: 
             SNP        feature_id
    8  rs3131972  GeneID:100302278
    
    0 讨论(0)
提交回复
热议问题