Partitioning in clojure with a lazy collection of strings

与世无争的帅哥 提交于 2019-12-13 13:14:06

问题


Starting with a collection of strings like:

(def str-coll ["abcd" "efgh" "jklm"])

The goal is to extract off a specific number of characters from the head of the string collection, generating a partitioned grouping of strings. This is the desired behavior:

(use '[clojure.contrib.str-utils2 :only (join)])
(partition-all 3 (join "" str-coll))

((\a \b \c) (\d \e \f) (\g \h \j) (\k \l \m))

However, using join forces evaluation of the entire collection, which causes memory issues when dealing with very large collections of strings. My specific use case is generating subsets of strings from a lazy collection generated by parsing a large file of delimited records:

(defn file-coll [in-file]
  (->> (line-seq (reader in-file))
    (partition-by #(.startsWith ^String % ">"))
    (partition 2))))

and is building on work from this previous question. I've tried combinations of reduce, partition and join but can't come up with the right incantation to pull characters from the head of the first string and lazily evaluate subsequent strings as needed. Thanks much for any ideas or pointers.


回答1:


Not quite sure what you're going for, but the following does what your first example does, and does so lazily.

Step-by-step for clarity:

user=> (def str-coll ["abcd" "efgh" "jklm"])
#'user/str-coll
user=> (map seq str-coll)
((\a \b \c \d) (\e \f \g \h) (\j \k \l \m))
user=> (flatten *1)
(\a \b \c \d \e \f \g \h \j \k \l \m)
user=> (partition 3 *1)
((\a \b \c) (\d \e \f) (\g \h \j) (\k \l \m))

All together now:

(->> str-coll 
  (map seq)
  flatten
  (partition 3))



回答2:


EDIT: EVERYTHING I'VE WRITTEN WAS WRONG

When a function with a var-arg is applied to with a seq longer than the number of discrete args, the remainder of the seq is passed as the var-arg (see RestFn.applyTo).

To Jürgen: I'm stupid. You're smart. I was wrong. You were right. You're the best. I'm the worst. You're very good-looking. I'm not attractive.

The following is a record of my idiocy...


Responding to Jürgen Hötzel's comment.

mapcat isn't fully lazy because apply isn't lazy in evaluating the number of args to apply. Further, apply can't be lazy because functions must be invoked with a discrete number of args. Currently if the number of args exceeds 20, the remaining args are dumped into an array, hence non-lazy.

So looking at the source for mapcat:

(defn mapcat
  "Returns the result of applying concat to the result of applying map
  to f and colls.  Thus function f should return a collection."
  {:added "1.0"}
  [f & colls]
    (apply concat (apply map f colls)))

If we expand the evaluation out using the example, the inner apply would evaluate to:

user=> (map seq str-coll)
((\a \b \c \d) (\e \f \g \h) (\j \k \l \m))

which is fine since the str-coll doesn't get fully realized, but then the outer apply would evaluate to:

user=> (concat '(\a \b \c \d) '(\e \f \g \h) '(\j \k \l \m))
(\a \b \c \d \e \f \g \h \j \k \l \m)

Note that the outer apply applies n arguments to concat, one for each string in the original str-coll. Now, it's true that the result of concat is lazy, and each arg is itself lazy, but you still need realize the full length of str-coll to get those n lazy seqs. If str-coll has 1000 strings, then concat will get 1000 args, and all 1000 strings would need to be read out of the file and into memory before concat could be called.


For the unbelivers, a demonstration of the seq-realizing behavior of apply:

user=> (defn loud-seq [] (lazy-seq (println "HELLO") (cons 1 (loud-seq))))
#'user/loud-seq
user=> (take 3 (loud-seq)) ; displaying the lazy-seq realizes it, thus printing HELLO
(HELLO
HELLO
1 HELLO
1 1)
user=> (do (take 3 (loud-seq)) nil) ; lazy-seq not realized; no printing of HELLO
nil
user=> (do (apply concat (take 3 (loud-seq))) nil) ; draw your own conclusions
HELLO
HELLO
HELLO
nil

And a demonstration that varargs are not lazy:

user=> (defn foo [& more] (type more))
#'user/foo
user=> (foo 1 2 3 4)
clojure.lang.ArraySeq
user=> (apply foo (repeat 4 1))
clojure.lang.Cons

Though as counterpoint, that the following works baffles me:

user=> (take 10 (apply concat (repeat [1 2 3 4])))
(1 2 3 4 1 2 3 4 1 2)


来源:https://stackoverflow.com/questions/3348719/partitioning-in-clojure-with-a-lazy-collection-of-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!