问题
Scala 2.11.8, spark 2.0.1
The explode function is very slow - so, looking for an alternate method. I think it is possible with RDD's with flatmap - and, help is greatly appreciated.
I have an udf that returns List(String, String, String, Int) of varying lengths. For each row in the dataframe, I want to create multiple rows, and make multiple columns.
def Udf = udf ( (s: String ) => {
if (s=="1") Seq(("a", "b", "c", 0), ("a1", "b1", "c1", 1), ("a2", "b2", "c2", 2)).toList
else Seq(("a", "b", "c", 0)).toList
})
val df = Seq(("a", "1"), ("b", "2")).toDF("A", "B")
val df1 = df.withColumn("C", Udf($"B"))
val df2 = df1.select($"A", explode($"C"))
val df3 = df2.withColumn("D", $"col._1").withColumn("E", $"col._2").withColumn("F", $"col._3").withColumn("G", $"col._4")
/// dataframe after going through udf
+---+---+--------------------+
| A| B| C|
+---+---+--------------------+
| a| 1|[[a,b,c,0], [a1,b...|
| b| 2| [[a,b,c,0]]|
+---+---+--------------------+
///Final dataframe
+---+------------+---+---+---+---+
| A| col| D| E| F| G|
+---+------------+---+---+---+---+
| a| [a,b,c,0]| a| b| c| 0|
| a|[a1,b1,c1,1]| a1| b1| c1| 1|
| a|[a2,b2,c2,2]| a2| b2| c2| 2|
| b| [a,b,c,0]| a| b| c| 0|
+---+------------+---+---+---+---+
This is very slow on many millions of rows. Takes over 12 hours.
回答1:
Here is another simple example:
val ds = sc.parallelize(Seq((0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3"))))
Alternative way of exploding arrays using flatMaps.
ds.flatMap { t =>
t._4.map { prp =>
(t._1, t._2, t._3, prp) }}.collect.foreach(println)
Result:
(0,Lorem ipsum dolor,1.0,prp1)
(0,Lorem ipsum dolor,1.0,prp2)
(0,Lorem ipsum dolor,1.0,prp3)
Tried with your dataset but not sure if its the optimal way of doing it.
df1.show(false)
+---+---+------------------------------------------------+
|A |B |C |
+---+---+------------------------------------------------+
|a |1 |[[a, b, c, 0], [a1, b1, c1, 1], [a2, b2, c2, 2]]|
|b |2 |[[a, b, c, 0]] |
+---+---+------------------------------------------------+
df1.rdd.flatMap { t:Row => t.getSeq(2).map { row: Row => (t.getString(0),t.getString(1),row)}}
.map {
case (col1: String,col2: String, col3: Row) => (col1, col2,col3.getString(0),col3.getString(1),col3.getString(2),col3.getInt(3))
}.collect.foreach(println)
Result:
(a,1,a,b,c,0)
(a,1,a1,b1,c1,1)
(a,1,a2,b2,c2,2)
(b,2,a,b,c,0)
Hope this helps!!
回答2:
You can:
- Update Spark to version 2.3 or later where SPARK-21657 should be fixed.
Replace your code with
flatMap
:df.as[(String, String)].flatMap { case (a, "1") => Seq( (a, "a", "b", "c", 0), (a, "a1", "b1", "c1", 1), (a, "a2", "b2", "c2", 2) ) case (a, _) => Seq((a, "a", "b", "c", 0)) }
来源:https://stackoverflow.com/questions/50376257/scala-spark-dataframe-explode-is-slow-so-alternate-method-create-columns-an