Spark UDF called more than once per record when DF has too many columns

前端 未结 3 946
半阙折子戏
半阙折子戏 2020-12-10 02:38

I\'m using Spark 1.6.1 and encountering a strange behaviour: I\'m running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input da

3条回答
  •  暖寄归人
    2020-12-10 03:35

    In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction

    i.e. use

    val myUdf = udf(...).asNondeterministic()
    

    This makes sure the UDF is only called once

提交回复
热议问题