发表新帖

发表新帖

where does df.cache() is stored

后端未结

关注

 4  1897

清歌不尽 2021-02-03 15:35

I would like to understand in which node (driver or worker/executor) does below code is stored

df.cache() //df is a large dataframe (200GB)

And

4条回答

没有蜡笔的小新 (楼主)

2021-02-03 16:18
df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level

The persist() method calls sparkSession.sharedState.cacheManager.cacheQuery() and when you see the code for cacheTable it also calls the same sparkSession.sharedState.cacheManager.cacheQuery()

that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level
- NONE
- DISK_ONLY
- DISK_ONLY_2
- MEMORY_ONLY
- MEMORY_ONLY_2
- MEMORY_ONLY_SER
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK
- MEMORY_AND_DISK_2
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK_SER_2
- OFF_HEAP
You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM

Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager

See here for details!

You can choose as per your requirement!

Hope this helps!
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题