What is the difference between memory_only and memory_and_disk caching level in spark?

后端 未结 2 675
自闭症患者
自闭症患者 2020-12-25 12:37

How is the behavior of memory_only and memory_and_disk caching level in spark differ?

相关标签:
2条回答
  • 2020-12-25 13:00

    As explained in the documentation, Persistence levels in terms of efficiency:

    Level                Space used  CPU time  In memory  On disk  Serialized
    -------------------------------------------------------------------------
    MEMORY_ONLY          High        Low       Y          N        N
    MEMORY_ONLY_SER      Low         High      Y          N        Y
    MEMORY_AND_DISK      High        Medium    Some       Some     Some
    MEMORY_AND_DISK_SER  Low         High      Some       Some     Y
    DISK_ONLY            Low         High      N          Y        Y
    

    MEMORY_AND_DISK and MEMORY_AND_DISK_SER spill to disk if there is too much data to fit in memory.

    0 讨论(0)
  • 2020-12-25 13:08

    Documentation says ---

    Storage Level

    Meaning

    MEMORY_ONLY

    Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

    MEMORY_AND_DISK

    Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

    MEMORY_ONLY_SER

    Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

    MEMORY_AND_DISK_SER

    Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

    DISK_ONLY

    Store the RDD partitions only on disk.

    MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Same as the levels above, but replicate each partition on two cluster nodes.

    OFF_HEAP (experimental)

    Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.

    It means for Memory ONLY, spark will try to keep partitions in memory always. If some partitions can not be kept in memory, or for node loss some partitions are removed from RAM, spark will recompute using lineage information. In memory-and-disk level, spark will always keep partitions computed and cached. It will try to keep in RAM, but if it does not fit then paritions will be spilled to disk.

    0 讨论(0)
提交回复
热议问题