Reading in file block by block using specified delimiter in python

放肆的年华 提交于 2019-12-01 18:36:50

A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

Output:

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data

For FASTA formats in general I would recommend using Biopython package.

One approach that I like is to use itertools.groupby together with a simple key fuction:

from itertools import groupby


def make_grouper():
    counter = 0
    def key(line):
        nonlocal counter
        if line.startswith('>'):
            counter += 1
        return counter
    return key

Use it as:

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        fasta_section = ''.join(group)   # or list(group)

You need the join only if you have to handle the contents of a whole section as a single string. If you are only interested in reading the lines one by one you can simply do:

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        # parse >header description
        header, description = next(group)[1:].split(maxsplit=1)
        for line in group:
            # handle the contents of the section line by line
def read_blocks(file):
    block = ''
    for line in file:
        if line.startswith('>') and len(block)>0:
            yield block
            block = ''
        block += line
    yield block


with open('input_file.fa') as f:
    for block in read_blocks(f):
        print(block)

This will read in the file line by line and you will get back the blocks with the yield statement. This is lazy so you don't have to worry about large memory footprint.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!