Enable GPU resources (CUDA) on DC/OS

99封情书 提交于 2019-12-25 14:12:37

问题


I have got a cluster with gpu nodes (nvidia) and deployed DC/OS 1.8. I'd like to enable to schedule jobs (batch and spark) on gpu nodes using gpu isolation. DC/OS is based on mesos 1.0.1 that supports gpu isolation.


回答1:


Unfortunately, DC/OS doesn't officially support GPUs in 1.8 (experimental support for GPUs will be coming in the next release as mentioned here: https://github.com/dcos/dcos/pull/766 ).

In this next release, only Marathon will officially be able to launch GPU services (Metronome (i.e. batch jobs) will not).

Regarding spark, the spark version bundled with Universe probably doesn't have GPU support for Mesos built in yet. Spark itself has it coming soon though: https://github.com/apache/spark/pull/14644




回答2:


In order to enable supporting gpu resources in DC/OS cluster the next steps are needed:

  1. Configure mesos agents on gpu nodes:
    1.1. Stop dcos-mesos-slave.service:

    systemctl stop dcos-mesos-slave.service

    1.2. Add the next parameters into /var/lib/dcos/mesos-slave-common file:

    # a comma separated list of GPUs (id), as determined by running nvidia-smi on the host where the agent is to be launched MESOS_NVIDIA_GPU_DEVICES="0,1"

    # value of the gpus resource must be complied with number of ids above MESOS_RESOURCES= [ {"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1025, "end": 2180},{"begin": 2182, "end": 3887},{"begin": 3889, "end": 5049},{"begin": 5052, "end": 8079},{"begin": 8082, "end": 8180},{"begin": 8182, "end": 32000}]}} ,{"name": "gpus","type": "SCALAR","scalar": {"value": 2}}]

    MESOS_ISOLATION=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,docker/volume,cgroups/devices,gpu/nvidia

    1.3. Start dcos-mesos-slave.service:

    systemctl start dcos-mesos-slave.service

  2. Enable the GPU_RESOURCES capability in mesos frameworks:

    2.1. Marathon framework should be launched with the option --enable_features "gpu_resources"

    2.2. Aurora scheduler should be launched with the option -allow_gpu_resource

Note.

Any host running a Mesos agent with Nvidia GPU support MUST have a valid Nvidia kernel driver installed. It is also highly recommended to install the corresponding user-level libraries and tools available as part of the Nvidia CUDA toolkit. Many jobs that use Nvidia GPUs rely on CUDA and not including it will severely limit the type of GPU-aware jobs you can run on Mesos.



来源:https://stackoverflow.com/questions/40346321/enable-gpu-resources-cuda-on-dc-os

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!