问题
Given input data from a jdbc source such as this:
(def input-data
[{:doc_id 1 :doc_seq 1 :doc_content "this is a very long "}
{:doc_id 1 :doc_seq 2 :doc_content "sentence from a mainframe "}
{:doc_id 1 :doc_seq 3 :doc_content "system that was built before i was "}
{:doc_id 1 :doc_seq 4 :doc_content "born."}
{:doc_id 2 :doc_seq 1 :doc_content "this is a another very long "}
{:doc_id 2 :doc_seq 2 :doc_content "sentence from the same mainframe "}
{:doc_id 3 :doc_seq 1 :doc_content "Ok here we are again. "}
{:doc_id 3 :doc_seq 2 :doc_content "The mainframe only had 40 char per field so"}
{:doc_id 3 :doc_seq 3 :doc_content "they broke it into multiple rows "}
{:doc_id 3 :doc_seq 4 :doc_content "which seems to be common"}
{:doc_id 3 :doc_seq 5 :doc_content " for the time. "}
{:doc_id 3 :doc_seq 6 :doc_content "thanks for your help."}])
I want to group by doc id
, and string-concat the doc_content
, so my output would look like this:
[{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
{:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
{:doc_id 3 :doc_content "... clip..."}]
I was thinking of using group-by
however that outputs a map, and I need to
output something lazy as the input data set could be very large. Maybe I could run group-by
and some compbination of reduce-kv
to get what i'm looking for... or maybe something with frequencies
if i can coerce it to be lazy.
I can guarantee that it will be sorted; I will put the order by (through sql) on doc_id
, and doc_seq
, so the only thing this program is responsible for is the aggregate/string-concat part. I will likely have large input data for the whole sequence, but a specific doc_id
in that sequence should only be a few dozen doc_seq
.
Any tips appreciated,
回答1:
partition-by
is lazy, and as long as each doc sequence fits in memory, this should work:
(defn collapse-docs [docs]
(apply merge-with
(fn [l r]
(if (string? r)
(str l r)
r))
docs))
(sequence ;; you may want to use eduction here, depending on use case
(comp
(partition-by :doc_id)
(map collapse-docs))
input-data)
=>
({:doc_id 1,
:doc_seq 4,
:doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
{:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
{:doc_id 3,
:doc_seq 6,
:doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})
来源:https://stackoverflow.com/questions/49097325/clojure-strng-concat-with-group-by-in-sequences-of-maps