Extracting BLAST output columns in CSV form with python

问题

I have a csv file in excel which contains the output from a BLAST search in the following format:

# BLASTN 2.2.29+                                            
# Query: Cryptocephalus androgyne                                           
# Database: SANdouble                                           
# Fields: query id   subject id  % identity  alignment length    mismatches  gap opens   q. start    q. end  s. start    s. end  evalue  bit score
# 1 hits found                                          
Cryptocephalus  ctg7180000094003    79.59   637 110 9   38  655 1300    1935    1.00E-125   444
# BLASTN 2.2.29+                                            
# Query: Cryptocephalus aureolus                                            
# Database: SANdouble                                           
# Fields: query id   subject id  % identity  alignment length    mismatches  gap opens   q. start    q. end  s. start    s. end  evalue  bit score
# 4 hits found                                          
Cryptocephalus  ctg7180000093816    95.5    667 12  8   7   655 1269    1935    0   1051
Cryptocephalus  ctg7180000094021    88.01   667 62  8   7   655 1269    1935    0   780
Cryptocephalus  ctg7180000094015    81.26   667 105 13  7   654 1269    1934    2.00E-152   532
Cryptocephalus  ctg7180000093818    78.64   515 106 4   8   519 1270    1783    2.00E-94    340

I have imported this as a csv into python using

with open('BLASToutput.csv', 'rU') as csvfile:
    contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in contents:
        table = ', '.join(row)

What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).

The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!

回答1:

The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.

Since your overall aim is

to count all the matches which have over 98% identity (the third column).

and the data file content is well formed, you can use normal file parsing approach:

import re

with open('BLASToutput.csv') as f:
    # read the file line by line
    for line in f:
        # skip comments (or maybe leave as it is)
        if line.startswith('#'):
            # print line
            continue
        # split fields
        fields = re.split(r' +', line)
        # check if the 3rd field is greater than 98%
        if float(fields[2]) > 98:
            # output the matched line
            print line

回答2:

I managed to find one way based on:

Python: split files using mutliple split delimiters

import csv

csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)


identity = []

for line in reader:
    identity.append(line[2])

print identity

来源：https://stackoverflow.com/questions/23404971/extracting-blast-output-columns-in-csv-form-with-python

标签

python

excel

csv

blast