merging files based on column coordinates of two files in python

问题

I have a file called snp.txt that looks like this:

chrom   chromStart  chromEnd    name    strand     observed     
chr1    259         260      rs72477211  +   A/G    single  
chr1    433         433      rs56289060  +   -/C    insertion   
chr1    491         492      rs55998931  +   C/T    single  
chr1    518         519      rs62636508  +   C/G    single  
chr1    582         583      rs58108140  +   A/G    single

I have a second file gene.txt

chrom   chromStart  chromEnd    tf_title    tf_score
chr1    200         270         NFKB1       123
chr1    420         440         IRF4        234
chr1    488         550         BCL3        231
chr1    513         579         TCF12       12
chr1    582         583         BAD170      89

The final output I want is: output.txt

chrom   chromStart  chromEnd    name    strand  observed    tf_title    tf_score
chr1    259         260      rs72477211    +    A/G         NFKB1       123
chr1    433         433      rs56289060    +    -/C         IRF4        234
chr1    491         492      rs55998931    +    C/T         BCL3        231
chr1    518         519      rs62636508    +    C/G         TCF12       12
chr1    582         583      rs58108140    +    A/G         BAD170      89

The key thing I want to be able to do is to look at gene.txt and check if the rsnumber in the name column of snp.txt is in the same region established by chrom, chromStart and chromEnd.

For example:

In the first row of snp.txt the rsid rs72477211 is on chr1 between positions 259 and 260.

Now in gene.txt, NFKB1 is also on chr1 but between positions 200 and 270, this means that rsid rs72477211 is located the NFKB1 region, so this is noted in output txt.

I am unable to do this in using pandas merge function and I'm not sure where to even start. the files are extremely large so a loop would be highly inefficient. Can someone please help? Thanks!

回答1:

If it fits in memory, you can merge the two dataframes with an outer method base only on chrom column, then filter your result by doing the range inclusion math:

df = snp.merge(gene, how='outer', on='chrom')
df = df[(df.chromStart_x>=df.chromStart_y) & (df.chromEnd_x<=df.chromEnd_y)]

You can eventually remove the duplicate columns:

del test['chromStart_y']
del test['chromEnd_y']

来源：https://stackoverflow.com/questions/17457142/merging-files-based-on-column-coordinates-of-two-files-in-python

标签

python

merge

pandas