scala spark dataframe explode is slow - so, alternate method - create columns and rows from arrays in a column

问题

Scala 2.11.8, spark 2.0.1

The explode function is very slow - so, looking for an alternate method. I think it is possible with RDD's with flatmap - and, help is greatly appreciated.

I have an udf that returns List(String, String, String, Int) of varying lengths. For each row in the dataframe, I want to create multiple rows, and make multiple columns.

def Udf = udf ( (s: String ) => {
   if (s=="1") Seq(("a", "b", "c", 0), ("a1", "b1", "c1", 1), ("a2", "b2", "c2", 2)).toList   
       else Seq(("a", "b", "c", 0)).toList
})

val df = Seq(("a", "1"), ("b", "2")).toDF("A", "B")
val df1 = df.withColumn("C", Udf($"B"))
val df2 = df1.select($"A", explode($"C"))
val df3 = df2.withColumn("D", $"col._1").withColumn("E", $"col._2").withColumn("F", $"col._3").withColumn("G", $"col._4")

/// dataframe after going through udf
+---+---+--------------------+
|  A|  B|                   C|
+---+---+--------------------+
|  a|  1|[[a,b,c,0], [a1,b...|
|  b|  2|         [[a,b,c,0]]|
+---+---+--------------------+

///Final dataframe
+---+------------+---+---+---+---+
|  A|         col|  D|  E|  F|  G|
+---+------------+---+---+---+---+
|  a|   [a,b,c,0]|  a|  b|  c|  0|
|  a|[a1,b1,c1,1]| a1| b1| c1|  1|
|  a|[a2,b2,c2,2]| a2| b2| c2|  2|
|  b|   [a,b,c,0]|  a|  b|  c|  0|
+---+------------+---+---+---+---+

This is very slow on many millions of rows. Takes over 12 hours.

回答1:

Here is another simple example:

val ds = sc.parallelize(Seq((0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3"))))

Alternative way of exploding arrays using flatMaps.

ds.flatMap { t => 
  t._4.map { prp => 
    (t._1, t._2, t._3, prp) }}.collect.foreach(println)

Result:

(0,Lorem ipsum dolor,1.0,prp1)
(0,Lorem ipsum dolor,1.0,prp2)
(0,Lorem ipsum dolor,1.0,prp3)

Tried with your dataset but not sure if its the optimal way of doing it.

df1.show(false)

+---+---+------------------------------------------------+
|A  |B  |C                                               |
+---+---+------------------------------------------------+
|a  |1  |[[a, b, c, 0], [a1, b1, c1, 1], [a2, b2, c2, 2]]|
|b  |2  |[[a, b, c, 0]]                                  |
+---+---+------------------------------------------------+


df1.rdd.flatMap { t:Row => t.getSeq(2).map { row: Row => (t.getString(0),t.getString(1),row)}}
.map {
    case (col1: String,col2: String, col3: Row) => (col1, col2,col3.getString(0),col3.getString(1),col3.getString(2),col3.getInt(3))
  }.collect.foreach(println)

Result:

(a,1,a,b,c,0)
(a,1,a1,b1,c1,1)
(a,1,a2,b2,c2,2)
(b,2,a,b,c,0)

Hope this helps!!

回答2:

You can:

Update Spark to version 2.3 or later where SPARK-21657 should be fixed.

Replace your code with flatMap:

df.as[(String, String)].flatMap { 
  case (a, "1") => Seq(
    (a, "a", "b", "c", 0), (a, "a1", "b1", "c1", 1), 
    (a, "a2", "b2", "c2", 2)
  )
  case (a, _) => Seq((a, "a", "b", "c", 0))
}

来源：https://stackoverflow.com/questions/50376257/scala-spark-dataframe-explode-is-slow-so-alternate-method-create-columns-an

标签

scala

performance

apache-spark

rdd

user-defined-functions