问题
I have a seemingly simple problem but have been stuck on it for too long now. I would like to compare two files (format shown below)
> file1
20 246057 0.28 68363 0 A
20 246058 0.28 68396 T C
20 246059 0.28 76700 A G
20 246060 0.28 76771 T C
20 246061 0.28 76915 0 A
> file2
112879285 R 68303 20
200068921 M 68319 20
200257910 K 68336 20
200192457 W 68363 20
138777928 Y 68396 20
I want to compare file1 column 0 and 3 with file2 column 2 and 3 and if they match, I want to output the rest of the information for matching rows from both files as follows:
> desired output
20 246057 0.28 68363 0 A 200192457 W
20 246058 0.28 68396 T C 138777928 Y
This is the code I have up to this point, I have tried several variations of this and many of the suggestions on here but I'm still stuck as to how to get the corresponding info from file1. Most of the things I try result in a repeat of the last line in file1 for every match.
#!/usr/bin/python
import csv
data2 = []
output = open("output.txt","w")
with open("file1.txt", "rb") as in_file1, open("file2.txt","rb") as in_file2:
reader1 = csv.reader((in_file1), delimiter="\t")
for row1 in reader1:
y1 = row1[0], row1[3]
data2.append(tuple(y1))
y = row1
reader2 = csv.reader((in_file2), delimiter="\t")
for row2 in reader2:
z = row2[-1], row2[2]
if tuple(z) in data2:
out = "\t".join(row2)
output.write(out+"\n")
The part I am struggling with is getting the output from file1, after parsing. So I am currently ending up with the result below, but I also want the corresponding info for these rows from file1:
> current output
200192457 W 68363 20
138777928 Y 68396 20
Any help or suggestions are greatly appreciated! Thank you! (I am using python 2.7)
回答1:
Here's a solution written from scratch by me:
f1 = file("file1.txt")
f2 = file("file2.txt")
d = {}
while True:
line = f1.readline()
if not line:
break
c0,c1,c2,c3,c4,c5 = line.split()
d[(c0,c3)] = (c0,c1,c2,c3,c4,c5)
while True:
line = f2.readline()
if not line:
break
c0,c1,c2,c3 = line.split()
if (c3,c2) in d:
vals = d[(c3,c2)]
print c3,vals[1],vals[2],vals[3],vals[4],vals[5],c0,c1
It reads the first file, and stores the values into a dict
using tuple
keys. Then it reads the second file, and checks if the tuple
key exists in the dictionary. If so, it prints all of the data.
Note that you have to remember to close the files too in a final working version of the program. For brevity, I left out the rows to close the files.
回答2:
That's a nice use case for join, awk, and cut:
$ join -11 -24 file1 file2 | awk '$4 == $9 { }' | cut -d' ' -f1-8
Output:
20 246057 0.28 68363 0 A 200192457 W
20 246058 0.28 68396 T C 138777928 Y
Explanation:
- Join the two files
file1
andfile2
on the first (-11
) and fourth (-24
) field. - Filter only those lines where the 4th and the 9th field are equal (
$4 == $9
); print these lines ({ }
). - From these lines print only the 1st to 8th field (
-f1-8
).
回答3:
Try modifying your code to following, you actually need to store row1 for which you get the match in file2:
with open("file1.txt", "rb") as in_file1, open("file2.txt","rb") as in_file2:
reader1 = csv.reader((in_file1), delimiter="\t")
for row1 in reader1:
y1 = row1[0], row1[3]
reader2 = csv.reader((in_file2), delimiter="\t")
for row2 in reader2:
z = row2[-1], row2[2]
if tuple(z) in [tuple(y1)]:
out = "\t".join(row1)
output.write(out+"\n")
out = "\t".join(row2)
output.write(out+"\n")
来源:https://stackoverflow.com/questions/29278265/python-comparing-columns-in-2-files-and-returning-merged-output