Convert from clojure.lang.LazySeq to type org.apache.spark.api.java.JavaRDD

妖精的绣舞 提交于 2019-12-11 03:57:38

问题


I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given

(:require [flambo.api :as f])

(defn replicate-val
  [ rdd input ]
  (let [{:keys [ col ]} input
    result (reductions (fn [a b]
                         (if (empty? (nth b col))
                           (assoc b col (nth a col))
                           b)) rdd )]
(println "Result type is: "(type result))))

Got this:

;=> "Result type is:  clojure.lang.LazySeq"

The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)

I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type

I got this error

"No matching method found: map for class clojure.lang.LazySeq"

which is expected because result is of type clojure.lang.LazySeq

Question is how to I make this conversion, or how can I refactor the code to accomodate this.

Here is a sample input rdd:

(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"

But looks like:

[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]

Required output is:

[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]

Thanks.


回答1:


First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.

The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.

First lets create a broadcast variable with last values per partition:

(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)

(def last-per-part (f/fn [it]
  (let [context (TaskContext/get) xs (iterator-seq it)]
  [[(.partitionId context) (last xs)]])))

(def last-vals-bd
 (bd/broadcast sc
   (into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))

Next some helper for the actual job:

(defn fill-pair [col]
  (fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))

(def fill-pairs
  (f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
                   xs (iterator-seq it) ;; Convert input to seq
                   prev (if (zero? part-id) ;; Find previous element
                     (first xs) ((bd/value last-vals-bd) part-id))        
                   ;; Create seq of pairs (prev, current)
                   pairs (partition 2 1 (cons prev xs))
                   ;; Same as before
                   {:keys [ col ]} input
                   ;; Prepare mapping function
                   mapper (fill-pair col)]
               (map mapper pairs))))

Finally you can use fill-pairs to map-partitions:

(-> rdd (f/map-partitions fill-pairs) (f/collect))

A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.

Alternative approach is to zipWithIndex, swap order of values and perform join with offset.

(require '[flambo.tuple :as tp])

(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))

(def rdd-idx-offset
  (f/map-to-pair rdd-idx
    (fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))

(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)

Next you can map using similar approach as before.

Edit

Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure

(defn foo [x] (let [aa @a] (swap! a (fn [&args] x)) aa))

(def a (atom 0))
(map foo (range 1 20))

compared to:

(def a (atom 0))
(pmap foo (range 1 20))


来源:https://stackoverflow.com/questions/32209413/alternative-to-mutable-data-structure-in-clojure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!