SeqIO.parse on a fasta.gz

…衆ロ難τιáo~ 提交于 2019-12-10 01:58:51

问题


New to coding. New to Pytho/biopython; this is my first question online, ever. How do I open a compressed fasta.gz file to extract info and perform calcuations in my function. Here is a simplified example of what I'm trying to do (I've tried different ways), and what the error is. The gzip command I'm using doesn't seem to work.?

with gzip.open("practicezip.fasta.gz", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)

Traceback (most recent call last):

  File "<ipython-input-192-a94ad3309a16>", line 2, in <module>
    for record in SeqIO.parse(handle, "fasta"):

  File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\__init__.py", line 600, in parse
    for r in i:

  File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\FastaIO.py", line 122, in FastaIterator
    for title, sequence in SimpleFastaParser(handle):

  File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\FastaIO.py", line 46, in SimpleFastaParser
    if line[0] == ">":

IndexError: index out of range

回答1:


Are you using python3?

This ("r" --> "rt") could solve your problem.

import gzip
from Bio import SeqIO

with gzip.open("practicezip.fasta.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)



回答2:


Here is a solution if you want to handle both regular text and gzipped files:

import gzip
from mimetypes import guess_type
from functools import partial

from Bio import SeqIO

input_file = 'input_file.fa.gz'

encoding = guess_type(input_file)[1]  # uses file extension
if encoding is None:
    _open = open
elif encoding == 'gzip':
    _open = partial(gzip.open, mode='rt')
else:
    raise ValueError('Unknown file encoding: "{}"'.format(encoding))

with _open(input_file) as f:
    for record in SeqIO.parse(f, 'fasta'):
        print(record)

NOTE: this relies on the file having the correct file extension, which I think is reasonable nearly all of the time (and the errors are obvious and explicit if this assumption is not met). However, read here for ways to actually check the file content rather than relying on this assumption.



来源:https://stackoverflow.com/questions/42757283/seqio-parse-on-a-fasta-gz

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!