I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT
I too am learning Erlang, thanks for the fun question.
I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.
Also, if you don't mind working with a file directly rather than standard_io
, perhaps you could speed things along by using file:open(..., [raw, read_ahead])
. raw
means the file must be on the local node's filesystem, and read_ahead
specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)
I'd expect the read_ahead
to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".
EDIT
Using file:open("uniprot_sprot.fasta", [read, read_ahead])
gets 1m31s
on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)
Using file:open(.., [read, read_ahead])
and file:read_line(S)
, I get 0m34s
.
Using file:open(.., [read, read_ahead, raw])
and file:read_line(S)
, I get 0m9s
. Yes, nine seconds.
Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case file:read_line(S) of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
{ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
case F of
{ ok, File } ->
{Sequences,Total}=scanLines(File,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]);
{ error, Reason } ->
io:format("~s", Reason)
end,
halt().