BioPython AlignIO ValueError says strings must be same length?

北城余情 提交于 2021-02-10 11:28:17

问题


Input fasta-format text file:

http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt

#!/usr/bin/python

from Bio import AlignIO

seq_file = open('/path/to/fa_alignment_PF00205.txt')
alignment = AlignIO.read(seq_file, "fasta")

Error:

ValueError: Sequences must all be the same length

The input sequences shouldn't have to be the same length since on ClustalOmega you can align sequences of differing lengths.

This also doesn't work...gets the same error:

alignment = AlignIO.parse(seq_file,"fasta")

for record in alignment:
    print(record.id)

Does anybody who is familiar with BioPython know how to get around this to align sequences from fasta files?


回答1:


Pad the sequence that is too short and write the records to to a temporary FASTA file. Than your alignments works as expected:

from Bio import AlignIO
from Bio import SeqIO
from Bio import Seq
import os

input_file = '/path/to/fa_alignment_PF00205.txt'
records = SeqIO.parse(input_file, 'fasta')
records = list(records) # make a copy, otherwise our generator
                        # is exhausted after calculating maxlen
maxlen = max(len(record.seq) for record in records)

# pad sequences so that they all have the same length
for record in records:
    if len(record.seq) != maxlen:
        sequence = str(record.seq).ljust(maxlen, '.')
        record.seq = Seq.Seq(sequence)
assert all(len(record.seq) == maxlen for record in records)

# write to temporary file and do alignment
output_file = '{}_padded.fasta'.format(os.path.splitext(input_file)[0])
with open(output_file, 'w') as f:
    SeqIO.write(records, f, 'fasta')
alignment = AlignIO.read(output_file, "fasta")
print alignment

This outputs:

SingleLetterAlphabet() alignment with 104 rows and 275 columns
TKAAIELIADHQ.......LTVLADLLVHRLQ..AVKELEALLA...QAL SP|A2VGF0.1/208-339
LQELASVINQHE...KV..MLFCGHGCR...Y..AVEEVMALAK...EDL SP|A3D4X6.1/190-319
IKKIAQAIEKAK...KP..VICAGGGVINS.N..ASEELLTLSR...KEL SP|A3DID9.1/192-327
IDEAAEAINKAE...RP..VILAGGGVSIA.G..ANKELFEFAT...QLL SP|A3DIY4.1/192-327
IEKAIELINSSQ...RP..FICSGGGVISS.E..ASEELIQFAE...KIL SP|A4XHS0.1/191-326
IKRAVEAIENSQ...RP..VICSGGGVIAS.R..ASDELKILVE...SEI SP|A4XIL5.1/194-328
VRQAARIIMESE...RP..VIYAGGGVRIS.G..AAPELLELSE...RAL SP|A5D4V9.1/192-327
LQALAQRILRAQ...RP..VIITGDEIVKS.D..ALQAAADFAS...LQL SP|A5ECG1.1/192-328
VEKAVELLWSAR...RV..LVISGRGAR...G..AGPELIGLLD...RAM SP|A5EDH4.1/198-324
IQKAARLIETAE...KP..VIIAGHGVNIS.G..ANEELKTLAE...KSL SP|A5FR34.1/193-328
LDALARDLDSAA...RV..TIYAGIGAR...G..AAARVVQLAG...EAL SP|A5FTR0.1/189-317
VADVAALLRAAR...RP..VIVAGGGVIHSG...AEERLATFAA...DAL SP|A5G0X6.1/217-351
IAEAVSALKGAK...RP..IIYTGGGLINS.GPESAELIVQLAK...RAL SP|A5G2E1.1/199-336
LKKAAEIINRAK...RP..LIYAGGGITLA.G..ASAELRALAA...ALL SP|A5GC69.1/192-327
CRDIVGKLLQSH...RP..VVLGGTGVRLS.R..TEQRLLALVE...DVF SP|A5W0I1.1/200-336
LDQAALKLAAAE...RP..MIIAGGGA..L.H..AAEQLAQLSA...AGL SP|A5W220.1/196-326
LQRAADILNTGH...KV..AILVGAGAL...Q..ATEQVIAIAE...RAL SP|A5W364.1/198-328
IRKAAEMLLAAK...RP..VVYSGGGVILG.G..GSEALTEIAK...SEM SP|A5W954.1/196-331
...
LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAE...... SP|Q220C3.1/190-328



回答2:


your problem is last record of fasta ... tail -9 fa_alignment_PF00205.txt

>SP|Q21VK8.1/229-357
LQAALAALAKAE...RP..LLVIGSQALVLSK..QAEHLAEAVARL.GIPV.YLSGMA..RGLLG.R..........DH.
...............PLQ..................MRHQRRQALRE..ADCVLLAG.VP...CDFRLD......YGKHV
RR..............S.AT.........L..IAA.N......................RSA.........KDARLNR..
.......K...PD.IAAIGDAG.......LFLQAL
>SP|Q220C3.1/190-328
LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAEAF.SLPV.FCSFRR..QMLFS.A..........NH.
...............ACY...AG.DLGLG.A.....NQRLLARI.RQ..SDLILLLG.GR...MSEVPS......QGYEL
LGIPAPQQ...........D

Sequence with id SP|Q220C3.1/190-328 has different length than other sequences



来源:https://stackoverflow.com/questions/32833230/biopython-alignio-valueerror-says-strings-must-be-same-length

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!