Linkedin TonY

懵懂的女人 提交于 2021-02-13 16:43:50

背景说明:TensorFlowOnYARN开源时间较早,目前作者已经停止维护,业内推荐TonY系统;

安装环境:Centos 7.0或以上,物理机,Python-2.7.5虚拟环境(默认),tensorflow-1.13.1;

下载地址:https://github.com/linkedin/TonY,git clone https://github.com/linkedin/TonY.git;

组件安装:apt-get update,apt-get install wget,apt-get install vim,apt-get install git,上传jdk,编辑.bashrc配置java环境;

代码编译:./gradlew build或./gradlew build -x test,前者构建并测试,后者只构建不测试,目标文件位于./tony-cli/build/libs/路径下;

目标文件:

root@b9683a1b9302:~/TonY/tony-cli/build/libs# ll -h
total 29M
drwxr-xr-x 2 root root 4.0K Nov 4 03:17 ./
drwxr-xr-x 9 39040 staff 4.0K Nov 4 03:01 ../
-rw-r--r-- 1 root root 29M Nov 4 03:17 tony-cli-0.3.23-all.jar
-rw-r--r-- 1 root root 12K Nov 4 03:01 tony-cli-0.3.23.jar

 

Python-3.7.0安装:(非必须)

  1. wget https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz,下载安装包;
  2. tar -xvf Python-3.7.0.tgz,安装包解压缩,cd Python-3.7.0进入安装包根目录;
  3. ./configure --enable-optimizations,初始化Makefile编译文件;
  4. make altinstall,编译文件,python3.7目标文件位于/usr/local/bin路径下;

 

Python虚拟环境构建:

  1. wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz;
  2. tar -xvf virtualenv-16.0.0.tar.gz,解压缩安装包;
  3. python virtualenv-16.0.0/virtualenv.py venv,构建虚拟环境;
  4. 执行命令“. venv/bin/activate“,进入虚拟环境;
  5. pip install tensorflow==1.13.1,在虚拟环境中安装TensorFlow组件;
  6. zip -r venv.zip venv,对Python虚拟环境进行压缩,目标文件约268MB;

 

pip install tensorflow==1.13.1清单:

  1. Downloading https://files.pythonhosted.org/packages/d2/ea/ab2c8c0e81bd051cc1180b104c75a865ab0fc66c89be992c4b20bbf6d624/tensorflow-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl (92.5MB)
  2. Downloading https://files.pythonhosted.org/packages/3b/72/e6e483e2db953c11efa44ee21c5fdb6505c4dffa447b4263ca8af6676b62/absl-py-0.8.1.tar.gz (103kB)
  3. Downloading https://files.pythonhosted.org/packages/88/ec/f598b633c3d5ffe267aaada57d961c94fdfa183c5c3ebda2b6d151943db6/backports.weakref-1.0.post1-py2.py3-none-any.whl
  4. Downloading https://files.pythonhosted.org/packages/89/ac/48dd71c2bdc8d31e367f9b72f25ccb3b89bc6b9d664fee21f9a8efa5714d/tensorboard-1.13.1-py2-none-any.whl (3.2MB)
  5. Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
  6. Downloading https://files.pythonhosted.org/packages/d7/b1/3367ea1f372957f97a6752ec725b87886e12af1415216feec9067e31df70/numpy-1.16.5-cp27-cp27mu-manylinux1_x86_64.whl (17.0MB)
  7. Downloading https://files.pythonhosted.org/packages/05/d2/f94e68be6b17f46d2c353564da56e6fb89ef09faeeff3313a046cb810ca9/mock-3.0.5-py2.py3-none-any.whl
  8. Downloading https://files.pythonhosted.org/packages/21/56/4bcec5a8d9503a87e58e814c4e32ac2b32c37c685672c30bc8c54c6e478a/Keras_Applications-1.0.8.tar.gz (289kB)
  9. Downloading https://files.pythonhosted.org/packages/bb/48/13f49fc3fa0fdf916aa1419013bb8f2ad09674c275b4046d5ee669a46873/tensorflow_estimator-1.13.0-py2.py3-none-any.whl (367kB)
  10. Downloading https://files.pythonhosted.org/packages/59/54/4441f0b3c44e38b1377d31c137cdaa6dfad225f5ee79612ed87131427baf/grpcio-1.24.3-cp27-cp27mu-manylinux2010_x86_64.whl (2.2MB)
  11. Downloading https://files.pythonhosted.org/packages/d1/4f/950dfae467b384fc96bc6469de25d832534f6b4441033c39f914efd13418/astor-0.8.0-py2.py3-none-any.whl
  12. Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
  13. Downloading https://files.pythonhosted.org/packages/1f/04/4e36c33f8eb5c5b6c622a1f4859352a6acca7ab387257d4b3c191d23ec1d/gast-0.3.2.tar.gz
  14. Downloading https://files.pythonhosted.org/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl
  15. Downloading https://files.pythonhosted.org/packages/c5/49/ffa7ab9c52ec56b535cffec3bc844254c073888e6d4aeee464671ac97480/protobuf-3.10.0-cp27-cp27mu-manylinux1_x86_64.whl (1.3MB)
  16. Downloading https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
  17. Downloading https://files.pythonhosted.org/packages/ce/42/3aeda98f96e85fd26180534d36570e4d18108d62ae36f87694b476b83d6f/Werkzeug-0.16.0-py2.py3-none-any.whl (327kB)
  18. Downloading https://files.pythonhosted.org/packages/d8/a6/f46ae3f1da0cd4361c344888f59ec2f5785e69c872e175a748ef6071cdb5/futures-3.3.0-py2-none-any.whl
  19. Downloading https://files.pythonhosted.org/packages/c0/4e/fd492e91abdc2d2fcb70ef453064d980688762079397f779758e055f6575/Markdown-3.1.1-py2.py3-none-any.whl (87kB)
  20. Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
  21. Downloading https://files.pythonhosted.org/packages/12/90/3216b8f6d69905a320352a9ca6802a8e39fdb1cd93133c3d4163db8d5f19/h5py-2.10.0-cp27-cp27mu-manylinux1_x86_64.whl (2.8MB)

 

Hadoop搭建:hadoop集群搭建

 

工程目录:

MyJob/
    myjob.sh(执行脚本)
    > src/(工程代码)
        > models/
            mnist_distributed.py
    tony.xml(工程配置)
    tony-cli-0.3.23-all.jar(TonY系统jar包)
    venv.zip(python虚拟环境)

 

myjob.sh脚本:

#!/bin/sh

java -cp `hadoop classpath`:/home/homework/MyJob/tony-cli-0.3.23-all.jar com.linkedin.tony.cli.ClusterSubmitter \
    --python_venv=/home/homework/MyJob/venv.zip \    # python虚拟环境压缩包路径
    --src_dir=/home/homework/MyJob/src/models \    # 工程代码路径
    --executes=mnist_distributed.py \    # 工程代码main文件
    --task_params="--steps 1000 --data_dir /tmp/data --working_dir /tmp/model" \    # 主程入口参数
    --conf_file=/home/homework/MyJob/tony.xml \    # tony.xml配置文件路径
    --python_binary_path=venv/bin/python    # python虚拟环境python路径

 

tony.xml配置:

<configuration>
    <property>
        <name>tony.worker.instances</name>
        <value>2</value>
        <description>worker总数</description>
    </property>
    <property>
        <name>tony.worker.memory</name>
        <value>4g</value>
        <description>worker内存</description>
    </property>
    <property>
        <name>tony.ps.instances</name>
        <value>1</value>
        <description>ps总数</description>
    </property>
    <property>
        <name>tony.ps.memory</name>
        <value>3g</value>
        <description>ps内存</description>
    </property>
    <property>
        <name>tony.application.security.enabled</name>
        <value>false</value>
        <description>从集群以及客户端和AM之间获取令牌</description>
    </property>
</configuration>

 

venv.zip列表:(25967个文件)

[homework@localhost]$ unzip -Z1 venv.zip | head -n 10
venv/
venv/lib/
venv/lib/python2.7/
venv/lib/python2.7/sre_compile.pyc
venv/lib/python2.7/no-global-site-packages.txt
venv/lib/python2.7/_abcoll.pyc
venv/lib/python2.7/copy_reg.py
venv/lib/python2.7/distutils/
venv/lib/python2.7/distutils/__init__.py
venv/lib/python2.7/distutils/__init__.pyc

 

mnist_distributed.py代码:

cluster_spec_str = os.environ["CLUSTER_SPEC"]   # 读取系统环境变量
cluster_spec = json.loads(cluster_spec_str)    # 环境变量解析
ps_hosts = cluster_spec['ps']    # 获取ps地址配置
worker_hosts = cluster_spec['worker']    # 获取worker地址配置

# 从参数server和worker hosts创建集群
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# 为本地任务创建并启动服务器
job_name = os.environ["JOB_NAME"]
task_index = int(os.environ["TASK_INDEX"])
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index)

 

执行任务:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/homework/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/homework/MyJob/tony-cli-0.3.23-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-08 10:46:51,237 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
2019-11-08 10:46:51,303 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, yarn-default.xml, yarn-site.xml, resource-types.xml, null/core-site.xml, null/hdfs-site.xml
2019-11-08 10:46:51,368 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-08 10:46:52,271 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:46:56,196 INFO cli.ClusterSubmitter: Copying /mnt/homework/MyJob/tony-cli-0.3.23-all.jar to: hdfs:// localhost :9000/user/homework/.tony/5077ae03-8d28-4bc3-8195-d0daea3e3018
2019-11-08 10:46:56,231 INFO tony.TonyClient: TonY heartbeat interval [1000]
2019-11-08 10:46:56,231 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
2019-11-08 10:46:56,254 INFO tony.TonyClient: Starting client..
2019-11-08 10:46:56,258 INFO client.RMProxy: Connecting to ResourceManager at localhost/192.168.0.100:8032
2019-11-08 10:46:56,489 INFO conf.Configuration: resource-types.xml not found
2019-11-08 10:46:56,489 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-11-08 10:46:56,531 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:46:56,545 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:06,235 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,497 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,723 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,736 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,748 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:13,981 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-08 10:47:14,021 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.ApplicationMaster 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
2019-11-08 10:47:14,023 INFO tony.TonyClient: Submitting YARN application
2019-11-08 10:47:14,071 INFO impl.YarnClientImpl: Submitted application application_1573174209638_0003
2019-11-08 10:47:14,072 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://localhost:8088/proxy/application_1573174209638_0003/
2019-11-08 10:47:14,072 INFO tony.TonyClient: ResourceManager web address for application: http://localhost:8088/cluster/app/application_1573174209638_0003
2019-11-08 10:47:24,111 INFO tony.TonyClient: Driver (application master) log url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000001/homework
2019-11-08 10:47:24,111 INFO tony.TonyClient: AM host: localhost
2019-11-08 10:47:24,111 INFO tony.TonyClient: AM RPC port: 14894
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: ps, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework status: RUNNING
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework status: RUNNING
2019-11-08 10:47:27,291 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework status: RUNNING
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for ps 0 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for worker 0 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework
2019-11-08 10:47:27,293 INFO tony.TonyClient: Logs for worker 1 at: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework
2019-11-08 10:48:42,503 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000004/homework status: SUCCEEDED
2019-11-08 10:48:43,506 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000003/homework status: SUCCEEDED
2019-11-08 10:48:44,509 INFO tony.TonyClient: Task status updated: [TaskInfo] name: ps, index: 0, url: http://localhost:8042/node/containerlogs/container_1573174209638_0003_01_000002/homework status: FINISHED
2019-11-08 10:48:45,512 INFO tony.TonyClient: Application 3 finished with YarnState=FINISHED, DSFinalStatus=SUCCEEDED, breaking monitoring loop.
2019-11-08 10:48:45,512 INFO tony.TonyClient: Link for application_1573174209638_0003's events/metrics: https://localhost:19886/jobs/application_1573174209638_0003
2019-11-08 10:48:45,518 INFO tony.TonyClient: Sent message to AM to stop.
2019-11-08 10:48:45,518 INFO tony.TonyClient: Application completed successfully
2019-11-08 10:48:45,535 INFO impl.YarnClientImpl: Killed application application_1573174209638_0003

 

结果输出:

[homework@localhost]$ ll
total 76764
-rw-rw-r-- 1 homework homework 39295624 Nov 8 11:11 model.ckpt-0.data-00000-of-00001
-rw-rw-r-- 1 homework homework 994 Nov 8 11:11 model.ckpt-0.index
-rw-rw-r-- 1 homework homework 39295624 Nov 8 11:11 model.ckpt-1002.data-00000-of-00001
-rw-rw-r-- 1 homework homework 994 Nov 8 11:11 model.ckpt-1002.index

 

任务提交:

 

任务运行:

 

任务完成:

 

任务终止:

yarn application -kill application_1573094688604_0003(任务ID),kill任务;

 

任务失败用例:

  • 修改mnist_distributed.py文件190行CLUSTER_SPEC为CLUSTER_SPEC1;
  • 提交任务再次执行,任务执行失败,打印错误信息如下;

  • 查看日志,显示错误信息如下:

 

过程描述:

  1. INFO cli.ClusterSubmitter: Copying /mnt/homework/MyJob/tony-cli-0.3.23-all.jar to: hdfs://localhost:9000/user/homework/.tony/1bc58531-bc24-453c-869f-b3530f44277e,将TonY.jar提交到hdfs上;
  2. INFO impl.YarnClientImpl: Submitted application application_1573183872429_0002,复制工程所需配置、代码与依赖至hdfs上;

说明:

    1. 当任务正常终止的话,hdfs会自动清理数据,当任务被异常终止的话,hdfs会残留中间数据,需定期清理;
    2. hadoop控制台页面地址无法访问,需将机器名添加到C:\Windows\System32\drivers\etc\hosts文件中即可;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!