Why does a job fail with “No space left on device”, but df says otherwise?

后端 未结 8 1272
無奈伤痛
無奈伤痛 2020-12-04 15:51

When performing a shuffle my Spark job fails and says \"no space left on device\", but when I run df -h it says I have free space left! Why does this happen, a

8条回答
  •  误落风尘
    2020-12-04 16:14

    You need to also monitor df -i which shows how many inodes are in use.

    on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.

    https://spark-project.atlassian.net/browse/SPARK-751

    If you do indeed see that disks are running out of inodes to fix the problem you can:

    • Decrease partitions (see coalesce with shuffle = false).
    • One can drop the number to O(R) by “consolidating files”. As different file-systems behave differently it’s recommended that you read up on spark.shuffle.consolidateFiles and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf.
    • Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.

    EDIT

    Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808

提交回复
热议问题