I have a cluster and I execute wholeTextFiles
which should pull about a million text files who sum up to approximately 10GB
total
I have one NameNo
To summarize my recommendations from the comments:
--num-executors
) with 1 thread each (--executor-cores
) and 512m of RAM (--executor-memory
), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasksSo my recommendation is:
--num-executors 4 --executor-memory 12g --executor-cores 4
which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallelsc.wholeTextFiles
to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration