Improving clojure lazy-seq usage for iterative text parsing

I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

For more background see this related StackOverflow post about an Erlang solution.

My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

and an equivalent approach without external libraries:

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?

Alex Taggart

It probably doesn't matter, but average is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

Your average function is non-lazy -- it needs to realise the entire coll argument while holding onto its head. Update: Just realised that my original answer included a nonsensical suggestion as to how to solve the above problem... argh. Fortunately ataggart has since posted a correct solution.

Other than that, your code does seem lazy at first glance, though the use of read-lines is currently discouraged (use line-seq instead).

If the file is really large and your functions will be called a large number of times, type-hinting seq-iter in the argument vector of seq-length -- ^NameOfBiojavaSeqIterClass seq-iter, use #^ in place of ^ if you're on Clojure 1.1 -- might make a significant difference. In fact, (set! *warn-on-reflection* true), then compile your code and add type hints to remove all reflection warnings.

来源：https://stackoverflow.com/questions/3303848/improving-clojure-lazy-seq-usage-for-iterative-text-parsing

标签

clojure

lazy-evaluation

bioinformatics