问题
I am running a spark job on a 2 node yarn cluster. My dataset is not large (< 100MB) just for testing and the worker is getting killed because it is asking for too much virtual memory. The amounts here are absurd. 2GB out of 11GB physical memory used, 300GB virtual memory used.
16/02/12 05:49:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.1 (TID 22, ip-172-31-6-141.ec2.internal): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1455246675722_0023_01_000003 on host: ip-172-31-6-141.ec2.internal. Exit status: 143. Diagnostics: Container [pid=23206,containerID=container_1455246675722_0023_01_000003] is running beyond virtual memory limits. Current usage: 2.1 GB of 11 GB physical memory used; 305.3 GB of 23.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1455246675722_0023_01_000003 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 23292 23213 23292 23206 (python) 15 3 101298176 5514 python -m pyspark.daemon |- 23206 1659 23206 23206 (bash) 0 0 11431936 352 /bin/bash -c /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp '-Dspark.driver.port=37386' -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar 1> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stdout 2> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stderr |- 23341 23292 23292 23206 (python) 87 8 39464374272 23281 python -m pyspark.daemon |- 23350 23292 23292 23206 (python) 86 7 39463976960 24680 python -m pyspark.daemon |- 23329 23292 23292 23206 (python) 90 6 39464521728 23281 python -m pyspark.daemon |- 23213 23206 23206 23206 (java) 1168 61 11967115264 359820 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp -Dspark.driver.port=37386 -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar |- 23347 23292 23292 23206 (python) 87 10 39464783872 23393 python -m pyspark.daemon |- 23335 23292 23292 23206 (python) 83 9 39464112128 23216 python -m pyspark.daemon |- 23338 23292 23292 23206 (python) 81 9 39463714816 24614 python -m pyspark.daemon |- 23332 23292 23292 23206 (python) 86 6 39464374272 24812 python -m pyspark.daemon |- 23344 23292 23292 23206 (python) 85 30 39464374272 23281 python -m pyspark.daemon Container killed on request. Exit code is 143
Does anyone know why this might be happening? I've been trying modifying various yarn and spark configurations, but I know something is deeply wrong for it to be asking for this much vmem.
回答1:
The command I was running was
spark-submit --executor-cores 8 ...
It turns out the executor-cores flag doesn't do what I thought it does. It makes 8 copies of the pyspark.daemon process, running 8 copies of the worker process to run jobs. Each process was using 38GB of virtual memory, which is unnecessarily large, but 8 * 38 ~ 300, so that explains that.
It's actually a very poorly named flag. If I set executor-cores to 1, it makes one daemon, but the daemon will use multiple cores, as seen via htop.
来源:https://stackoverflow.com/questions/35355823/spark-worker-asking-for-absurd-amounts-of-virtual-memory