“average length of the sequences in a fasta file”: Can you improve this Erlang code?

前端 未结 5 1505
无人共我
无人共我 2021-02-06 12:29

I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT         


        
5条回答
  •  天命终不由人
    2021-02-06 13:08

    It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.

    Learn and use fprof.

    You're using string:strip/1 primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.

    I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using erlang:size/1 to find the length.

    It won't affect performance (significantly), but the multiplication by 1.0 in Total/(1.0*Sequences) is only necessary in languages with broken division. Erlang division works correctly.

提交回复
热议问题