Select sequences in a fasta file with more than 300 aa and “C” occurs at least 4 times

故事扮演 提交于 2019-12-13 14:28:10

问题


I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.

I've used this command to select sequences with more than 300 aa:

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }' 

Some sequence example:

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ* 

回答1:


I do not know bioawk but I assume it is identical to awk with some initial parsing and constant definitions.

I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C in and a length of more than 300, then you could do :

bioawk -c fastx '
   (length($seq) > 300) && (gsub("C","C",$seq)>4) {
       print ">"$name; print $seq
   }' 72hDOWN-fasta.fasta

but this assumes that seq is the full character sequence.

The idea behind it is the following. The gsub command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.

From the POSIX standard IEEE Std 1003.1-2017:

gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument, when specified.

sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in and return the number of substitutions. An <ampersand> ( & ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that if repl is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. If in is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. If in is omitted, awk shall use the current record ($0) in its place.

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.



来源:https://stackoverflow.com/questions/50214546/select-sequences-in-a-fasta-file-with-more-than-300-aa-and-c-occurs-at-least-4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!