split a fasta file and rename on the basis of first line

问题

I have a huge file with following content:

filename: input.txt

>chr1
jdlfnhl
dh,ndh
dnh.

dhjl

>chr2
dhfl
dhl
dh;l

>chr3

shgl
sgl

>chr2_random
dgld

I need to split this file in such a way that I get four separate file as below:

file 1: chr1.fa

>chr1
jdlfnhl
dh,ndh
dnh.

dhjl

file 2: chr2.fa

>chr2
dhfl
dhl
dh;l

file 3: chr3.fa

>chr3

shgl
sgl

file 4: chr2_random.fa

>chr2_random
dgld

I tried csplit in linux, but could not rename them by the text immediately after ">".

csplit -z input.txt '/>/' '{*}'

回答1:

Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job.

USAGE:
./foo.awk your_input_file

foo.awk:

#!/usr/bin/awk -f

/^>chr/ {
    OUT=substr($0,2) ".fa"
}

OUT {
    print >OUT
}

You can do that also in one line:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

回答2:

If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython.

Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html

回答3:

Slightly messy script, but should work on a large file as it only reads one line at a time

To run, you do python thescript.py input.txt (or it'll read from stdin, like cat input.txt | python thescript.py)

import sys
import fileinput

in_file = False

for line in fileinput.input():
    if line.startswith(">"):
        # Close current file
        if in_file:
            f.close()

        # Make new filename
        fname = line.rstrip().partition(">")[2]
        fname = "%s.fa" % fname

        # Open new file
        f = open(fname, "w")
        in_file = True

        # Write current line
        f.write(line)

    elif in_file:
        # Write line to currently open file
        f.write(line)

    else:
        # Something went wrong, no ">chr1" found yet
        print >>sys.stderr, "Line %r encountered, but no preceeding > line found"

回答4:

Your best bet would be to use the fastaexplode program from the exonerate suite:

$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch

fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. guy@ebi.ac.uk. 2000-2003.

Synopsis:
--------
fastaexplode <path>

General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
   --help [FALSE] 
-v --version [FALSE] 

Sequence Input Options:
----------------------
-f --fasta [mandatory]  <*** not set ***>
-d --directory [.] 

--

回答5:

with open('data.txt') as f:
    lines=f.read()
    lines=lines.split('>')
    lines=['>'+x for x in lines[1:]]
    for x in lines:
        file_name=x.split('\n')[0][1:]  #use this variable to create the new file
        fil=open(file_name+'.fa','w')
        fil.write(x)
        fil.close()

回答6:

If you specifically want to try this with python ,You can use this code

f2 = open("/dev/null", "r")
f = open("input.txt", "r")
for line in f:
    if ">" in line:
        f2.close()
        f2 = open(line.split(">")[1]),"w")
    else:
        f2.write(line)

f.close()

回答7:

Alternatively, BioPython could have been used. Installing it in a virtualenv is easy:

virtualenv biopython_env
source biopython_env/bin/activate
pip install numpy
pip install biopython

And once this is done, splitting the fasta file is easy. Let's assume you have the path to the fasta file in the fasta_file variable:

from Bio import SeqIO

parser = SeqIO.parse(fasta_file, "fasta")
for entry in parser:
   SeqIO.write(entry, "chr{}.fa".format(entry.id), "fasta")

Note that this version of format works in Python2.7, but it might not work in older versions.

As for performance, I used this to split the human genome reference from the 1000 Genomes project in negligible time, but I don't know how it would work for larger files.

回答8:

#!/usr/bin/perl-w
use strict;
use warnings;


my %hash =();
my $key = '';
open F, "input.txt", or die $!;
while(<F>){
    chomp;
    if($_ =~ /^(>.+)/){
        $key = $1;
    }else{
       push @{$hash{$key}}, $_ ;
    }   
}

foreach(keys %hash){
    my $key1 = $_;
    my $key2 ='';
    if($key1 =~ /^>(.+)/){
        $key2 = $1;
    }
    open MYOUTPUT, ">","$key2.fa", or die $!;
    print MYOUTPUT join("\n",$_,@{$hash{$_}}),"\n";
    close MYOUTPUT;
}

来源：https://stackoverflow.com/questions/11818495/split-a-fasta-file-and-rename-on-the-basis-of-first-line

标签

python

Linux

split

fasta