“average length of the sequences in a fasta file”: Can you improve this Erlang code?

前端 未结 5 1450
无人共我
无人共我 2021-02-06 12:29

I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT         


        
5条回答
  •  無奈伤痛
    2021-02-06 13:09

    I too am learning Erlang, thanks for the fun question.

    I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

    Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

    I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

    EDIT

    Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

    Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

    Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

    Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

    -module(golf).
    -export([test/0]).
    
    line([],{Sequences,Total}) ->  {Sequences,Total};
    line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
    line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
    
    scanLines(S,Sequences,Total)->
            case file:read_line(S) of
                eof -> {Sequences,Total};
                {error,_} ->{Sequences,Total};
                {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
            end  .
    
    test()->
        F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
        case F of
        { ok, File } -> 
            {Sequences,Total}=scanLines(File,0,0),
            io:format("~p\n",[Total/(1.0*Sequences)]);
        { error, Reason } ->
                io:format("~s", Reason)
        end,
        halt().
    

提交回复
热议问题