Merging DataFrames on multiple conditions - not specifically on equal values

前端 未结 2 979
梦如初夏
梦如初夏 2020-12-17 00:47

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

I am trying to join (merge) t

2条回答
  •  -上瘾入骨i
    2020-12-17 01:35

    I've just thought of a way to solve this - by combining my two methods:

    First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

    all_dfs = []
    
    for chromosome in snp_df['chromosome'].unique():
        this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
        this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]
    
        # Getting rid of redundant genes
        min_bp      = this_chr_snp['BP'].min()
        max_bp      = this_chr_snp['BP'].max()
        this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
                ~(this_genes['chr_stop'] <= min_bp)]
    
        for line in this_genes.iterrows():
            info     = line[1]
            this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                    (this_chr_snp['BP'] <= info['chr_stop'])]
            if this_snp.shape[0] != 0:
                this_snp    = this_snp[['SNP']]
                this_snp.insert(1, 'feature_id', info['feature_id'])
                all_dfs.append(this_snp)
    
    all_genic_snps  = pd.concat(all_dfs)
    

    While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

提交回复
热议问题