Remove line breaks in a FASTA file

问题

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA

I'd like to convert it into this:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

I found a potential solution on this site, which looks like this:

cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

However, this places an extra line break between each entry, so file looks like this:

>accession1
ATGGCCCATGGGATCCTAGC

>accession2
GATATCCATGAAACGGCTTA

I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta

However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

{empty line} 
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Anyone have a solution to get my file in the correct format? Thanks!

回答1:

This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

回答2:

The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file

Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file

回答3:

There is another awk one-liner, should work for your case.

awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file

回答4:

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

回答5:

Another variation :-)

awk '!/>/{printf( "%s", $0);next}
     NR>1{printf( "\n")} 
     END {printf"\n"}
     7' YourFile

回答6:

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

来源：https://stackoverflow.com/questions/15857088/remove-line-breaks-in-a-fasta-file

标签

unix

awk

fasta