Spark iterative/recursive algorithms - Breaking spark lineage

后端 未结 1 1408
故里飘歌
故里飘歌 2020-12-18 17:04

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

The original dataset is loaded from a Hive table partitioned by date.

相关标签:
1条回答
  • 2020-12-18 17:47

    Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.

    Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.

    There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.

    0 讨论(0)
提交回复
热议问题