where does df.cache() is stored

后端 未结 4 1897
清歌不尽
清歌不尽 2021-02-03 15:35

I would like to understand in which node (driver or worker/executor) does below code is stored

df.cache() //df is a large dataframe (200GB)

And

4条回答
  •  没有蜡笔的小新
    2021-02-03 16:18

    df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level

    The persist() method calls sparkSession.sharedState.cacheManager.cacheQuery() and when you see the code for cacheTable it also calls the same sparkSession.sharedState.cacheManager.cacheQuery()

    that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level

    • NONE
    • DISK_ONLY
    • DISK_ONLY_2
    • MEMORY_ONLY
    • MEMORY_ONLY_2
    • MEMORY_ONLY_SER
    • MEMORY_ONLY_SER_2
    • MEMORY_AND_DISK
    • MEMORY_AND_DISK_2
    • MEMORY_AND_DISK_SER
    • MEMORY_AND_DISK_SER_2
    • OFF_HEAP

    You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM

    Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager

    See here for details!

    You can choose as per your requirement!

    Hope this helps!

提交回复
热议问题