Performance impact of RDD API vs UDFs mixed with DataFrame API

前端 未结 1 1350
Happy的楠姐
Happy的楠姐 2020-12-29 15:33

(Scala-specific question.)

While Spark docs encourage the use of DataFrame API where possible, if DataFrame API is insufficient, the choice is usually between fallin

相关标签:
1条回答
  • 2020-12-29 16:16

    neither of them can benefit from Catalyst and Tungsten optimizations

    This is not exactly true. While UDFs don't benefit from Tungsten optimization (arguably simple SQL transformation don't get huge boost there either) you still may benefit from execution plan optimizations provided by Catalyst. Let's illustrate that with a simple example (Note: Spark 2.0 and Scala. Don't extrapolate this to earlier versions, especially with PySpark):

    val f = udf((x: String) => x == "a")
    val g = udf((x: Int) => x + 1)
    
    val df = Seq(("a", 1), ("b", 2)).toDF
    
    df
      .groupBy($"_1")
      .agg(sum($"_2").as("_2"))
      .where(f($"_1"))
      .withColumn("_2", g($"_2"))
      .select($"_1")
      .explain
    
    // == Physical Plan ==
    // *HashAggregate(keys=[_1#2], functions=[])
    // +- Exchange hashpartitioning(_1#2, 200)
    //    +- *HashAggregate(keys=[_1#2], functions=[])
    //       +- *Project [_1#2]
    //          +- *Filter UDF(_1#2)
    //             +- LocalTableScan [_1#2, _2#3]
    

    Execution plan shows us a couple of things:

    • Selection has been pushed down before aggregation.
    • Projection has been pushed down before aggregation and effectively removed second UDF call.

    Depending on the data and pipeline this can provide a substantial performance boost almost for free.

    That being said both RDDs and UDFs require migrations between safe and unsafe with the latter one being significantly less flexible. Still, if the only thing you need is a simple map-like behavior without initializing expensive objects (like database connections) then UDF is the way to go.

    In slightly more complex scenarios you can easily drop down to generic Dataset and reserve RDDs for cases when you really require an access to some low level features like custom partitioning.

    0 讨论(0)
提交回复
热议问题