For a set of dataframes
val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF(\"id\",\"x\")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toD
Under the Hood spark flattens union expressions. So it takes longer when the Union is done linearly.
The best solution is spark to have a union function that supports multiple DataFrames.
But the following code might speed up the union of multiple DataFrames (or DataSets)somewhat.
def union[T : ClassTag](datasets : TraversableOnce[Dataset[T]]) : Dataset[T] = {
binaryReduce[Dataset[T]](datasets, _.union(_))
}
def binaryReduce[T : ClassTag](ts : TraversableOnce[T], op: (T, T) => T) : T = {
if (ts.isEmpty) {
throw new IllegalArgumentException
}
var array = ts toArray
var size = array.size
while(size > 1) {
val newSize = (size + 1) / 2
for (i <- 0 until newSize) {
val index = i*2
val index2 = index + 1
if (index2 >= size) {
array(i) = array(index) // last remaining
} else {
array(i) = op(array(index), array(index2))
}
}
size = newSize
}
array(0)
}