using clojure-csv.core to parse a huge csv file

淺唱寂寞╮ 提交于 2019-12-04 23:50:54

问题


So far I have :

(:require [clojure-csv.core :as csv])
(:require [clojure.java.io :as io]))

(def csv-file (.getFile  (clojure.java.io/resource "verbs.csv")))

(defn process-csv [file]
  (with-open  [rdr  (io/reader file)]
    (csv/parse-csv rdr)))

But I am getting java.io.IOException: Stream closed. I am using clojure-csv and it exposes two methods, the first of which I am using, parse-csv, the doc says :

Takes a CSV as a char sequence or string, and returns a lazy sequence of vectors of strings

What I think I know : with-open is lazy, and the rdr in (csv/parse-csv rdr))) is a single line of the csv file right?

PS. I would also like to search the file, is it expensive to repeatedly open the file (even though it is being read lazily) and search the entire contents??


回答1:


with-open is not lazy, but if you do something lazy inside with-open, you can run into a problem if the lazy action is not forced inside the scope of the with-open. What needs to be done is to force all the lazy results before exiting the with-open block.

(defn process-csv [file]
  (with-open [rdr (io/reader file)]
    (doall (csv/parse-csv rdr))))

the function doall is designed for making sure the entirety of a lazy sequence is realized.

Due to the size of your input, another possibility would be to arrange to close the reader yourself, and then use laziness for its intended purpose (only generating results as you need them).

(defn find-results
 [stream]
 (for [record stream
       :while (seq (first record))]
   record))

(def rdr (io/reader "verbs.csv"))
(def csv (csv/parse-csv rdr))

(def results (doall (find-results csv)))

(.close rdr)



回答2:


I know this is already answered, but here's a similar solution to @noisesmith by creating an explicit lazy sequence, which auto closes if you reach the end of the input.

If you're going to process the entire file lazily this means you don't have to manage the handles yourself explicitly, otherwise you have open handle issues.

(defn lazy-read-csv
  [csv-file]
  (let [in-file (io/reader csv-file)
        csv-seq (csv/read-csv in-file)
        lazy (fn lazy [wrapped]
               (lazy-seq
                 (if-let [s (seq wrapped)]
                   (cons (first s) (lazy (rest s)))
                   (.close in-file))))]
    (lazy csv-seq)))

This is from the excellent Clojure Data Analysis Cookbook by Eric Rochester




回答3:


The problem is that your process-csv function does not really "process" the CSV data inside the with-open scope, but return it as lazy sequence. When the execution exits the with-open scope, the stream is already closed. Trying to traverse the lazy-list later on will throw the exception.

Unless you are confident that the CSV file can be read and parsed into memory altogether, I would suggest not to follow the what is recommended in other answers, i.e. to force evaluation of the lazy-sequence inside the with-open scope using doall.

Instead, if you prefer to separate the resource allocation and de-allocation part from the "more reusable" business logic, you should do something like this:

(defn process-csv [rdr conn]
  (doseq [row (csv/parse-csv rdr) :where (wanted? row)]
    (save-to-custom-database-table conn row)))

(defn start [process-fn]
  (let [csv-file (.getFile  (clojure.java.io/resource "verbs.csv"))]
    (with-open [rdr (jio/reader csv-file)
                conn (database-connection "TEST")]
      (process-fn rdr conn))))

(start process-csv)

As you can see, the process-csv function handles the reader and database resources in an "abstract" manner, i.e. is not bothered with the fact that these resources are Closeable and should be closed after usage. Instead the finalization / closing of the resources is handled in the start function as a separate matter.

I would also recommend you to look into Clojure protocols and see how they are useful in abstracting resources in similar scenarios like the above.




回答4:


Looks like file is trying to be parsed lazily outside the with-open form when the file is already closed.

Try something like this to verify, prints the first 5 parsed lines:

(defn process-csv [file]
  (with-open  [rdr  (io/reader file)]
    (let [lines (csv/parse-csv rdr)]
         (doseq [l (take 5 lines)]
            (println l)))))

I don't think opening the file many times would be expensive compared to searching something inside if the file is huge.

If you need to do it many times I would consider building some kind of search index though.



来源:https://stackoverflow.com/questions/19645160/using-clojure-csv-core-to-parse-a-huge-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!