Spark UDF called more than once per record when DF has too many columns

前端未结

关注

 3  944

I\'m using Spark 1.6.1 and encountering a strange behaviour: I\'m running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input da

相关标签:

3条回答

Happy的楠姐

2020-12-10 03:30
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
```
// get results of UDF
var results = data
  .withColumn("tmp", myUdf($"id"))
  .withColumn("result", $"tmp.a").cache()
```
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-12-10 03:35
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction

i.e. use
```
val myUdf = udf(...).asNondeterministic()
```
This makes sure the UDF is only called once
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-10 03:37
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.

We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this: SPARK-18748

We also opened a question here then, but now I see the title wasn't so good: Trying to turn a blob into multiple columns in Spark

I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
```
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
             .withColumn("result", $"tmp.a")
```
update:

Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
```
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
                  .withColumn("result", $"tmp.a")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...