Spark UDF called more than once per record when DF has too many columns

前端未结

关注

 3  945

半阙折子戏 2020-12-10 02:38

I\'m using Spark 1.6.1 and encountering a strange behaviour: I\'m running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input da

3条回答

Happy的楠姐 (楼主)

2020-12-10 03:30
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
```
// get results of UDF
var results = data
  .withColumn("tmp", myUdf($"id"))
  .withColumn("result", $"tmp.a").cache()
```
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...