问题
I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:
>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA
I'd like to convert it into this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I found a potential solution on this site, which looks like this:
cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
However, this places an extra line break between each entry, so file looks like this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:
awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta
However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:
{empty line}
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Anyone have a solution to get my file in the correct format? Thanks!
回答1:
This awk program:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
Will yield:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.
On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.
End with a newline, if required.
Note:
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
回答2:
The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:
awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file
Explanation:
For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.
Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:
awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
回答3:
There is another awk one-liner, should work for your case.
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
回答4:
I would use sed for this. Using GNU sed:
sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file
Results:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
回答5:
Another variation :-)
awk '!/>/{printf( "%s", $0);next}
NR>1{printf( "\n")}
END {printf"\n"}
7' YourFile
回答6:
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
来源:https://stackoverflow.com/questions/15857088/remove-line-breaks-in-a-fasta-file