问题
I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later).
My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair.
Traceback (most recent call last):
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
cycle(f,k,binLen)
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
return _align(**keywds)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
score_only)
MemoryError: Out of memory
I have tried many things to work around this (as many of you may see from the code), all to no avail. I have tried splitting the large master file into smaller batches to be fed into score calculating method. I have tried del files after I am done using them, I have tried using my Ubuntu 11.11 on an Oracle Virtual machine (I typically work in 64bit Windows 7). Am I being to ambitious is this computationally feasable in BioPython? Below is my code, I have no experience in memory debugging which is the clear culprit of this problem. Any assistance is greatly appreciated I am becoming very frustrated with this problem.
Best, Harry
##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)
from Bio import SeqIO
from Bio import pairwise2
import gc
##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
entry = True #Make sure we loop once
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
def split(subject,query):
##Query Iterator and Batch Subject Iterator
query_iterator = SeqIO.parse(query,"fasta")
record_iter = SeqIO.parse(subject,"fasta")
##Writes both large file into many small files
print "Splitting Subject File..."
binLen=2
for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
filename1="groupA_%i.fasta" % (j+1)
handle1=open(filename1, "w")
count1 = SeqIO.write(batch1, handle1, "fasta")
handle1.close()
print "Done splitting Subject file"
print "Splitting Query File..."
for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
filename2="groupB_%i.fasta" % (k+1)
handle2=open(filename2, "w")
count2 = SeqIO.write(batch2, handle2, "fasta")
handle2.close()
print "Done splitting both FASTA files"
print " "
return [k ,binLen]
##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')
def cycle(f,k,binLen):
i=1
m=1
while i<=k+1:
##Open the first small file
subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
i=i+1
j=0
##Make small file iterators
smallQuery=SeqIO.parse(queryFile,"fasta")
smallSubject=SeqIO.parse(subjectFile,"fasta")
##Cycles through both sets of FASTA files
while j<binLen:
j=j+1
currentQuery=smallQuery.next()
currentSubject=smallSubject.next()
##Verify every pair is correct
print " "
print "Pair: " + str(m)
print "Subject: "+ currentSubject.id
print "Query: " + currentQuery.id
gc.collect()
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
gc.collect()
currentQuery=None
currentSubject=None
score=str(a)
a=None
print "Score: " + score
f.write("1"+ "\n")
m=m+1
smallQuery.close()
smallSubject.close()
subjectFile.close()
queryFile.close()
gc.collect()
print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files
##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)
P.S. Be kind I am aware there is probably some goofy things in the code that I put in there trying to get around this problem.
回答1:
See also this very similar question on BioStars, http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/
There I suggested trying existing tools for this kind of thing, e.g. EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall (you can parse the EMBOSS alignment output with Biopython)
回答2:
The pairwise2
module was updated in the recent version of Biopython (1.68) to become faster and less memory consuming.
来源:https://stackoverflow.com/questions/10840467/biopython-how-do-i-stop-memoryerror-out-of-memory-exception