I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program
Without having a great understanding of what you are doing, I would have written the code like this:
def readFastaEntry( fp ):
name = ""
while True:
line = name or f.readline()
if not line:
break
seq = []
while True:
name = f.readline()
if not name or name.startswith(">"):
break
else:
seq.append(name)
yield (line, "".join(seq))
This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.