Python - comparing columns in 2 files and returning merged output

问题

I have a seemingly simple problem but have been stuck on it for too long now. I would like to compare two files (format shown below)

> file1
20  246057  0.28    68363   0   A
20  246058  0.28    68396   T   C
20  246059  0.28    76700   A   G
20  246060  0.28    76771   T   C
20  246061  0.28    76915   0   A

> file2
112879285   R   68303   20
200068921   M   68319   20
200257910   K   68336   20
200192457   W   68363   20
138777928   Y   68396   20

I want to compare file1 column 0 and 3 with file2 column 2 and 3 and if they match, I want to output the rest of the information for matching rows from both files as follows:

> desired output
20  246057  0.28    68363   0   A   200192457   W
20  246058  0.28    68396   T   C   138777928   Y

This is the code I have up to this point, I have tried several variations of this and many of the suggestions on here but I'm still stuck as to how to get the corresponding info from file1. Most of the things I try result in a repeat of the last line in file1 for every match.

#!/usr/bin/python
import csv

data2 = []
output = open("output.txt","w")

with open("file1.txt", "rb") as in_file1, open("file2.txt","rb") as in_file2:
    reader1 = csv.reader((in_file1), delimiter="\t")
    for row1 in reader1:
        y1 = row1[0], row1[3]
        data2.append(tuple(y1))
        y = row1
    reader2 = csv.reader((in_file2), delimiter="\t")
    for row2 in reader2:
        z = row2[-1], row2[2]
        if tuple(z) in data2:
            out = "\t".join(row2)
            output.write(out+"\n")

The part I am struggling with is getting the output from file1, after parsing. So I am currently ending up with the result below, but I also want the corresponding info for these rows from file1:

> current output
200192457   W   68363   20
138777928   Y   68396   20

Any help or suggestions are greatly appreciated! Thank you! (I am using python 2.7)

回答1:

Here's a solution written from scratch by me:

f1 = file("file1.txt")
f2 = file("file2.txt")
d = {}
while True:
  line = f1.readline()
  if not line:
    break
  c0,c1,c2,c3,c4,c5 = line.split()
  d[(c0,c3)] = (c0,c1,c2,c3,c4,c5)
while True:
  line = f2.readline()
  if not line:
    break
  c0,c1,c2,c3 = line.split()
  if (c3,c2) in d:
    vals = d[(c3,c2)]
    print c3,vals[1],vals[2],vals[3],vals[4],vals[5],c0,c1

It reads the first file, and stores the values into a dict using tuple keys. Then it reads the second file, and checks if the tuple key exists in the dictionary. If so, it prints all of the data.

Note that you have to remember to close the files too in a final working version of the program. For brevity, I left out the rows to close the files.

回答2:

That's a nice use case for join, awk, and cut:

$ join -11 -24 file1 file2 | awk '$4 == $9 { }' | cut -d' ' -f1-8

Output:

20 246057 0.28 68363 0 A 200192457 W
20 246058 0.28 68396 T C 138777928 Y

Explanation:

Join the two files file1 and file2 on the first (-11) and fourth (-24) field.
Filter only those lines where the 4th and the 9th field are equal ($4 == $9); print these lines ({ }).
From these lines print only the 1st to 8th field (-f1-8).

回答3:

Try modifying your code to following, you actually need to store row1 for which you get the match in file2:

with open("file1.txt", "rb") as in_file1, open("file2.txt","rb") as in_file2:
reader1 = csv.reader((in_file1), delimiter="\t")
for row1 in reader1:
    y1 = row1[0], row1[3]
    reader2 = csv.reader((in_file2), delimiter="\t")
    for row2 in reader2:
        z = row2[-1], row2[2]
        if tuple(z) in [tuple(y1)]:
              out = "\t".join(row1)
              output.write(out+"\n")    
              out = "\t".join(row2)
              output.write(out+"\n")

来源：https://stackoverflow.com/questions/29278265/python-comparing-columns-in-2-files-and-returning-merged-output

标签

python

loops

for-loop

multiple-columns

string-matching