发表新帖

发表新帖

What is the difference between spark checkpoint and persist to a disk

后端未结

关注

 4  1361

北海茫月 2020-11-29 16:29

What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?

4条回答

执笔经年 (楼主)

2020-11-29 16:58
1. Persist(MEMORY_AND_DISK) will store the data frame to disk and memory temporary without breaking the lineage of the program i.e. df.rdd.toDebugString() would return the same output. It is recommended to use persist(*) on a calculation, that is going to be reused to avoid recalculation of intermediate results:
```
df = df.persist(StorageLevel.MEMORY_AND_DISK)
calculation1(df)
calculation2(df)
```
  Note, that caching the data frame does not guarantee, that it will remain in memory until you call it next time. Depending on the memory usage the cache can be discarded.
2. checkpoint(), on the other hand, breaks lineage and forces data frame to be stored on disk. Unlike usage of cache()/persist(), frequent check-pointing can slow down your program. Checkpoints are recommended to use when a) working in an unstable environment to allow fast recovery from failures b) storing intermediate states of calculation when new entries of the RDD are dependent on the previous entries i.e. to avoid recalculating a long dependency chain in case of failure
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题