clojure - strng-concat with group by in sequences of maps

╄→尐↘猪︶ㄣ 提交于 2019-12-23 04:42:03

问题


Given input data from a jdbc source such as this:

  (def input-data
    [{:doc_id 1 :doc_seq 1  :doc_content "this is a very long "}
    {:doc_id 1 :doc_seq 2  :doc_content "sentence from a mainframe "}
    {:doc_id 1 :doc_seq 3  :doc_content "system that was built before i was "}
    {:doc_id 1 :doc_seq 4  :doc_content "born."}
    {:doc_id 2 :doc_seq 1  :doc_content "this is a another very long "}
    {:doc_id 2 :doc_seq 2  :doc_content "sentence from the same mainframe "}
    {:doc_id 3 :doc_seq 1  :doc_content "Ok here we are again. "}
    {:doc_id 3 :doc_seq 2  :doc_content "The mainframe only had 40 char per field so"}
    {:doc_id 3 :doc_seq 3  :doc_content "they broke it into multiple rows "}
    {:doc_id 3 :doc_seq 4  :doc_content "which seems to be common"}
    {:doc_id 3 :doc_seq 5  :doc_content " for the time. "}
    {:doc_id 3 :doc_seq 6  :doc_content "thanks for your help."}])

I want to group by doc id, and string-concat the doc_content, so my output would look like this:

  [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
   {:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
   {:doc_id 3 :doc_content "... clip..."}]

I was thinking of using group-by however that outputs a map, and I need to output something lazy as the input data set could be very large. Maybe I could run group-by and some compbination of reduce-kv to get what i'm looking for... or maybe something with frequencies if i can coerce it to be lazy.

I can guarantee that it will be sorted; I will put the order by (through sql) on doc_id, and doc_seq, so the only thing this program is responsible for is the aggregate/string-concat part. I will likely have large input data for the whole sequence, but a specific doc_id in that sequence should only be a few dozen doc_seq.

Any tips appreciated,


回答1:


partition-by is lazy, and as long as each doc sequence fits in memory, this should work:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})


来源:https://stackoverflow.com/questions/49097325/clojure-strng-concat-with-group-by-in-sequences-of-maps

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!