发表新帖

发表新帖

Spark iterative/recursive algorithms - Breaking spark lineage

后端未结

关注

 1  1408

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

The original dataset is loaded from a Hive table partitioned by date.

相关标签:

1条回答

天命终不由人

2020-12-18 17:47

Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.

Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.

There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题