How spark read a large file (petabyte) when file can not be fit in spark's main memory

后端 未结 2 1265
无人共我
无人共我 2020-11-29 21:36

What will happen for large files in these cases?

1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as

2条回答
  •  孤独总比滥情好
    2020-11-29 21:49

    This is quoted directly from Apache Spark FAQ (FAQ | Apache Spark)

    Does my data need to fit in memory to use Spark?

    No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

    In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk.

    The persist method in Apache Spark provides six persist storage level to persist the data.

    MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER 
    (Java and Scala), MEMORY_AND_DISK_SER 
    (Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
    

    The OFF_HEAP storage is under experimentation.

提交回复
热议问题