getExecutorMemoryStatus().size() not outputting correct num of executors

只谈情不闲聊 提交于 2019-12-01 13:47:19

Short fix: Allow time (e.g. add a sleep command) before you use defaultParallelism or _jsc.sc().getExecutorMemoryStatus() if you use either at the beginning of the application's execution.

Explanation: There seems to be a short period of time at startup when there is only one executor (I believe that the single executor is the driver, which in some contexts is considered as an executor). That's why using sc._jsc.sc().getExecutorMemoryStatus() at the top of the main function yielded the wrong number for me. The same happened with defaultParallelism(1).

My suspicion is that the driver starts working using itself as a worker before having all the workers connecting to it. It agrees with the fact that submitting the below code to spark-submit using --total-executor-cores 12

import time

conf = SparkConf().setAppName("app_name")
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger("dbg_et")

log.warn('defaultParallelism={0}, and size of executorMemoryStatus={1}'.format(sc.defaultParallelism,
           sc._jsc.sc().getExecutorMemoryStatus().size()))
time.sleep(15)
log.warn('After 15 seconds: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism, 
                  sc._jsc.sc().getExecutorMemoryStatus().size()))
rdd_collected = (sc.parallelize([1, 2, 3, 4, 5] * 200, 
spark_context_holder.getParallelismAlternative()*3)
             .map(lambda x: (x, x*x) * 2)
             .map(lambda x: x[2] + x[1])
             )
log.warn('Made rdd with {0} partitioned. About to collect.'
          .format(rdd_collected.getNumPartitions()))
rdd_collected.collect()
log.warn('And after rdd operations: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism,
                  sc._jsc.sc().getExecutorMemoryStatus().size()))

gave me the following output

> tail -n 4 slurm-<job number>.out
18/09/26 13:23:52 WARN dbg_et: defaultParallelism=2, and size of executorMemoryStatus=1
18/09/26 13:24:07 WARN dbg_et: After 15 seconds: defaultParallelism=12, and size of executorMemoryStatus=13
18/09/26 13:24:07 WARN dbg_et: Made rdd with 36 partitioned. About to collect.
18/09/26 13:24:11 WARN dbg_et: And after rdd operations: defaultParallelism=12, and size of executorMemoryStatus=13

and that checking time at which the worker directories were created, I saw it was just after the correct values to both defaultParallelism and getExecutorMemoryStatus().size() were recorded(2). The important thing is that this time was quite a long time (~10 seconds) after the recording of the wrong values for these two parameters (see the time of the line with "defaultParallelism=2" above vs the time of these directories' creation below)

 > ls -l --time-style=full-iso spark/worker_dir/app-20180926132351-0000/
 <permission user blah> 2018-09-26 13:24:08.909960000 +0300 0/
 <permission user blah> 2018-09-26 13:24:08.665098000 +0300 1/
 <permission user blah> 2018-09-26 13:24:08.912871000 +0300 10/
 <permission user blah> 2018-09-26 13:24:08.769355000 +0300 11/
 <permission user blah> 2018-09-26 13:24:08.931957000 +0300 2/
 <permission user blah> 2018-09-26 13:24:09.019684000 +0300 3/
 <permission user blah> 2018-09-26 13:24:09.138645000 +0300 4/
 <permission user blah> 2018-09-26 13:24:08.757164000 +0300 5/
 <permission user blah> 2018-09-26 13:24:08.996918000 +0300 6/
 <permission user blah> 2018-09-26 13:24:08.640369000 +0300 7/
 <permission user blah> 2018-09-26 13:24:08.846769000 +0300 8/
 <permission user blah> 2018-09-26 13:24:09.152162000 +0300 9/

(1) Before starting to use getExecutorMemoryStatus() I tried using defaultParallelism, as you should, but it kept giving me the number 2. Now I understand this is from the same reason. Running on a standalone cluster, if the the driver sees only 1 executor then defaultParallelism = 2 as can be seen in the documentation for spark.default.parallelism.

(2) I'm not sure how come the values are correct BEFORE the directories are created - but I'm assuming the executors' starting order has them connecting to the driver before creating the directories.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!