问题
I have a huge file with following content:
filename: input.txt
>chr1
jdlfnhl
dh,ndh
dnh.
dhjl
>chr2
dhfl
dhl
dh;l
>chr3
shgl
sgl
>chr2_random
dgld
I need to split this file in such a way that I get four separate file as below:
file 1: chr1.fa
>chr1
jdlfnhl
dh,ndh
dnh.
dhjl
file 2: chr2.fa
>chr2
dhfl
dhl
dh;l
file 3: chr3.fa
>chr3
shgl
sgl
file 4: chr2_random.fa
>chr2_random
dgld
I tried csplit in linux, but could not rename them by the text immediately after ">".
csplit -z input.txt '/>/' '{*}'
回答1:
Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job.
USAGE:./foo.awk your_input_file
foo.awk:
#!/usr/bin/awk -f
/^>chr/ {
OUT=substr($0,2) ".fa"
}
OUT {
print >OUT
}
You can do that also in one line:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
回答2:
If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython.
Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html
回答3:
Slightly messy script, but should work on a large file as it only reads one line at a time
To run, you do python thescript.py input.txt
(or it'll read from stdin, like cat input.txt | python thescript.py
)
import sys
import fileinput
in_file = False
for line in fileinput.input():
if line.startswith(">"):
# Close current file
if in_file:
f.close()
# Make new filename
fname = line.rstrip().partition(">")[2]
fname = "%s.fa" % fname
# Open new file
f = open(fname, "w")
in_file = True
# Write current line
f.write(line)
elif in_file:
# Write line to currently open file
f.write(line)
else:
# Something went wrong, no ">chr1" found yet
print >>sys.stderr, "Line %r encountered, but no preceeding > line found"
回答4:
Your best bet would be to use the fastaexplode program from the exonerate suite:
$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch
fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. guy@ebi.ac.uk. 2000-2003.
Synopsis:
--------
fastaexplode <path>
General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
--help [FALSE]
-v --version [FALSE]
Sequence Input Options:
----------------------
-f --fasta [mandatory] <*** not set ***>
-d --directory [.]
--
回答5:
with open('data.txt') as f:
lines=f.read()
lines=lines.split('>')
lines=['>'+x for x in lines[1:]]
for x in lines:
file_name=x.split('\n')[0][1:] #use this variable to create the new file
fil=open(file_name+'.fa','w')
fil.write(x)
fil.close()
回答6:
If you specifically want to try this with python ,You can use this code
f2 = open("/dev/null", "r")
f = open("input.txt", "r")
for line in f:
if ">" in line:
f2.close()
f2 = open(line.split(">")[1]),"w")
else:
f2.write(line)
f.close()
回答7:
Alternatively, BioPython could have been used. Installing it in a virtualenv is easy:
virtualenv biopython_env
source biopython_env/bin/activate
pip install numpy
pip install biopython
And once this is done, splitting the fasta file is easy. Let's assume you have the path to the fasta file in the fasta_file
variable:
from Bio import SeqIO
parser = SeqIO.parse(fasta_file, "fasta")
for entry in parser:
SeqIO.write(entry, "chr{}.fa".format(entry.id), "fasta")
Note that this version of format works in Python2.7, but it might not work in older versions.
As for performance, I used this to split the human genome reference from the 1000 Genomes project in negligible time, but I don't know how it would work for larger files.
回答8:
#!/usr/bin/perl-w
use strict;
use warnings;
my %hash =();
my $key = '';
open F, "input.txt", or die $!;
while(<F>){
chomp;
if($_ =~ /^(>.+)/){
$key = $1;
}else{
push @{$hash{$key}}, $_ ;
}
}
foreach(keys %hash){
my $key1 = $_;
my $key2 ='';
if($key1 =~ /^>(.+)/){
$key2 = $1;
}
open MYOUTPUT, ">","$key2.fa", or die $!;
print MYOUTPUT join("\n",$_,@{$hash{$_}}),"\n";
close MYOUTPUT;
}
来源:https://stackoverflow.com/questions/11818495/split-a-fasta-file-and-rename-on-the-basis-of-first-line