Remove line breaks in a FASTA file

前端 未结 9 1278
予麋鹿
予麋鹿 2020-12-05 01:26

I have a fasta file where the sequences are broken up with newlines. I\'d like to remove the newlines. Here\'s an example of my file:

>accession1
ATGGCC         


        
9条回答
  •  隐瞒了意图╮
    2020-12-05 02:06

    Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:

    # Create the input for testing:
    
    cat > test_unwrap_in.fa <seq1 with blanks
    ACGT ACGT ACGT
    >seq2 with newlines
    ACGT
    
    ACGT
    
    ACGT
    
    >seq3 without blanks or newlines
    ACGTACGTACGT
    
    EOF
    
    # Reformat with Perl:
    
    perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa
    

    Output:

    >seq1 with blanks
    ACGTACGTACGT
    >seq2 with newlines
    ACGTACGTACGT
    >seq3 without blanks or newlines
    ACGTACGTACGT
    

    The Perl one-liner uses these command line flags:
    -e : Tells Perl to look for code in-line, instead of in a file.
    -n : Loop over the input one line at a time, assigning it to $_ by default.

    chomp : Remove the input line separator (\n on *NIX).
    if ( /^>/ ) : Test if the current line is a sequence header line.
    $n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
    END { print "\n"; } : Print the final newline after the last sequence.
    s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

提交回复
热议问题