Spark: sc.WholeTextFiles takes a long time to execute

前端 未结 1 907
孤独总比滥情好
孤独总比滥情好 2020-12-17 06:07

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNo

相关标签:
1条回答
  • 2020-12-17 06:37

    To summarize my recommendations from the comments:

    1. HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
    2. Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

    So my recommendation is:

    1. Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
    2. Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration
    0 讨论(0)
提交回复
热议问题