Spark: long delay between jobs

后端 未结 2 1948
星月不相逢
星月不相逢 2020-12-03 03:09

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I\'m getting rando

2条回答
  •  离开以前
    2020-12-03 03:37

    I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node

    • Spark will write to temporary s3 directories, then move the files using the master node
    • Reading of text files often occur on the master node
    • When writing parquet files, the master node will scan all the files post-write to check the schema

    These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.

    Discussion of writing I/O Overhead with Parquet and s3

    Discussion of reading I/O Overhead "s3 is not a filesystem"

提交回复
热议问题