I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:
>accession1
ATGGCC
Do not reinvent the wheel. If the goal is simply removing newlines in multi-line fasta file (unwrapping fasta file), use any of the specialized bioinformatics tools, for example seqtk, like so:
seqtk seq -l 0 input_file
Example:
# Create the input for testing:
cat > test_unwrap_in.fa <<EOF
>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT
ACGT
ACGT
>seq3 without blanks or newlines
ACGTACGTACGT
EOF
# Unwrap lines:
seqtk seq -l 0 test_unwrap_in.fa > test_unwrap_out.fa
cat test_unwrap_out.fa
Output:
>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT
To install seqtk, you can use for example conda install seqtk.
SEE ALSO:
seqtk usage:
seqtk seq
Usage: seqtk seq [options] <in.fq>|<in.fa>
Options: ...
-l INT number of residues per line; 0 for 2^32-1 [0]
This awk program:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
Will yield:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.
On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.
End with a newline, if required.
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
I would use sed for this. Using GNU sed:
sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file
Results:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
Explanation:
Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
There have been great responses so far.
Here is an efficient way to do this in Python:
def read_fasta(fasta):
with open(fasta, 'r') as fast:
headers, sequences = [], []
for line in fast:
if line.startswith('>'):
head = line.replace('>','').strip()
headers.append(head)
sequences.append('')
else :
seq = line.strip()
if len(seq) > 0:
sequences[-1] += seq
return (headers, sequences)
def write_fasta(headers, sequences, fasta):
with open(fasta, 'w') as fast:
for i in range(len(headers)):
fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')
You can use the above functions to retrieve sequences/headers from a fasta file without line breaks, manipulate them, and write back to a fasta file.
headers, sequences = read_fasta('input.fasta')
new_headers = do_something(headers)
new_sequences = do_something(sequences)
write_fasta(new_headers, new_sequences, 'input.fasta')
There is another awk one-liner, should work for your case.
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
Another variation :-)
awk '!/>/{printf( "%s", $0);next}
NR>1{printf( "\n")}
END {printf"\n"}
7' YourFile