Let us assume i have one Tb data file. Each Node memory in ten node cluster is 3GB.
I want to process the file using spark. But how does the One TeraByte fits in mem
By default storage level is MEMORY_ONLY, which will try to fit the data in the memory. It will fail with out of memory issues if the data cannot be fit into memory.
It supports other storage levels such as MEMORY_AND_DISK, DISK_ONLY etc. You can go through Spark documentation to understand different storage levels. You can invoke persist function on RDD to use different storage level.