hadoop distcp using subprocess.Popen

霸气de小男生 提交于 2020-02-29 06:04:32

问题


I am trying to run hadoop distcp command using subprocess.Popen in python and get error - Invalid input. The same command runs fine if I run as Hadoop command.

Hadoop command:

hadoop distcp -log /user/name/distcp_log -skipcrccheck -update hdfs://xxxxx:8020/sourceDir hdfs://xxxxx:8020/destDir

In python:

from subprocess import Popen, PIPE
proc1 = Popen(['hadoop','distcp','-log /user/name/distcp_log -skipcrccheck -update', 'hdfs://xxxxx:8020/sourceDir', 'hdfs://xxxxx:8020/destDir'], stdout=subprocess.PIPE)

log message:

INFO tools.OptionsParser: parseChunkSize: blocksperchunk false
INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[-log /user/name/distcp_log -skipcrccheck -update, hdfs://xxxxx:8020/sourceDir], targetPath=hdfs://xxxxx:8020/destDir, targetPathExists=true, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192}
ERROR tools.DistCp: Invalid input:
org.apache.hadoop.tools.CopyListing$InvalidInputException: -log /user/name/distcp_log -skipcrccheck -update doesn't exist

It considering the options as the source directory. How to tell the subprocess these are options and should not be considered as source(sourcePaths=[-log /user/name/distcp_log -skipcrccheck -update, hdfs://xxxxx:8020/sourceDir] )?

I am using Python2.7 and do not have access to pip install and its Kerberos cluster. Wanted to run this command for intra-cluster transfer but before that wanted to try this simple command within cluster.

Thanks


回答1:


Split all arguments of your command line into separate elements of Popen first argument list:

from subprocess import Popen, PIPE
proc1 = Popen(['hadoop','distcp','-log', '/user/name/distcp_log', '-skipcrccheck', '-update', 'hdfs://xxxxx:8020/sourceDir', 'hdfs://xxxxx:8020/destDir'], stdout=subprocess.PIPE)

Here you can find Popen documentation, saying that args should be a list created by splitting all arguments by ' '.



来源:https://stackoverflow.com/questions/51153722/hadoop-distcp-using-subprocess-popen

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!