Remove line breaks in a FASTA file

混江龙づ霸主 提交于 2019-11-28 07:03:57

This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in , which is what you would do in and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file

Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file

There is another awk one-liner, should work for your case.

awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

Another variation :-)

awk '!/>/{printf( "%s", $0);next}
     NR>1{printf( "\n")} 
     END {printf"\n"}
     7' YourFile

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!