问题
I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!
回答1:
The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.
Since your overall aim is
to count all the matches which have over 98% identity (the third column).
and the data file content is well formed, you can use normal file parsing approach:
import re
with open('BLASToutput.csv') as f:
# read the file line by line
for line in f:
# skip comments (or maybe leave as it is)
if line.startswith('#'):
# print line
continue
# split fields
fields = re.split(r' +', line)
# check if the 3rd field is greater than 98%
if float(fields[2]) > 98:
# output the matched line
print line
回答2:
I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity
来源:https://stackoverflow.com/questions/23404971/extracting-blast-output-columns-in-csv-form-with-python