Merging DataFrames on multiple conditions - not specifically on equal values

前端未结

关注

 2  983

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.

I am trying to join (merge) t

相关标签:

2条回答

-上瘾入骨i

2020-12-17 01:35

I've just thought of a way to solve this - by combining my two methods:

First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

0 讨论(0)

执笔经年

2020-12-17 01:37

You can use the following to accomplish what you're looking for:

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]

Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:

snp_df
Out[193]: 
   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972   30400
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

gene_df
Out[194]: 
   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

merged_df
Out[195]: 
         SNP        feature_id
8  rs3131972  GeneID:100302278

0 讨论(0)