Spark unionAll multiple dataframes

前端 未结 3 2049

For a set of dataframes

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF(\"id\",\"x\")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toD         


        
3条回答
  •  甜味超标
    2020-11-27 17:59

    Under the Hood spark flattens union expressions. So it takes longer when the Union is done linearly.

    The best solution is spark to have a union function that supports multiple DataFrames.

    But the following code might speed up the union of multiple DataFrames (or DataSets)somewhat.

      def union[T : ClassTag](datasets : TraversableOnce[Dataset[T]]) : Dataset[T] = {
          binaryReduce[Dataset[T]](datasets, _.union(_))
      }
      def binaryReduce[T : ClassTag](ts : TraversableOnce[T], op: (T, T) => T) : T = {
          if (ts.isEmpty) {
             throw new IllegalArgumentException
          }
          var array = ts toArray
          var size = array.size
          while(size > 1) {
             val newSize = (size + 1) / 2
             for (i <- 0 until newSize) {
                 val index = i*2
                 val index2 = index + 1
                 if (index2 >= size) {
                    array(i) = array(index)  // last remaining
                 } else {
                    array(i) = op(array(index), array(index2))
                 }
             }
             size = newSize
         }
         array(0)
     }
    

提交回复
热议问题