distcp

No FileSystem for scheme: sftp

可紊 提交于 2020-01-06 19:27:06
问题 I am trying to use sftp in hadoop with distcp like below hadoop distcp -D fs.sftp.credfile=/home/bigsql/cred.prop sftp://<<ip address>>:22/export/home/nz/samplefile hdfs:///user/bigsql/distcp But I am getting the below error 15/11/23 13:29:06 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[sftp://<<source ip>

Using GroupBy while copying from HDFS to S3 to merge files within a folder

谁说我不能喝 提交于 2020-01-05 08:48:09
问题 I have the following folders in HDFS : hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 hdfs://x.x

s3-dist-cp and hadoop distcp job infinitely loopin in EMR

喜夏-厌秋 提交于 2019-12-25 07:44:58
问题 I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp: s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/ hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/ I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my

Hadoop: specify yarn queue for distcp

≡放荡痞女 提交于 2019-12-12 10:59:29
问题 On our cluster we have set up dynamic resource pools. The rules are set so that first yarn will look at the specified queue, then to the username, then to primary group ... However with a distcp I can't seem to be able to specify a queue, it just sets it to the primary group. This is how I run it now (which doesn't work): hadoop distcp -Dmapred.job.queue.name:root.default ....... 回答1: You are committing a mistake in the specification of the parameter. You should not use ":" for separating the

Multiple source files for s3distcp

情到浓时终转凉″ 提交于 2019-12-12 04:43:39
问题 Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work. I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp. Workaround that I am currently using is to tell all the file names in srcPattern hadoop jar s3distcp.jar --src s3n://bucket/src_folder/ --dest hdfs:///test/output/ -

Distcp Mismatch in length of source

假如想象 提交于 2019-12-11 02:24:35
问题 I am facing issue while executing distcp command between two different hadoop clusters, Caused by: java.io.IOException: Mismatch in length of source:hdfs://ip1/xxxxxxxxxx/xxxxx and target:hdfs://nameservice1/xxxxxx/.distcp.tmp.attempt_1483200922993_0056_m_000011_2 I tried using -pb and -skipcrccheck: hadoop distcp -pb -skipcrccheck -update hdfs://ip1/xxxxxxxxxx/xxxxx hdfs:///xxxxxxxxxxxx/ hadoop distcp -pb hdfs://ip1/xxxxxxxxxx/xxxxx hdfs:///xxxxxxxxxxxx/ hadoop distcp -skipcrccheck -update

Distcp - Container is running beyond physical memory limits

别来无恙 提交于 2019-12-02 02:34:41
问题 I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root , with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000 files ~= 30 GO I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest This subset is defined by a list of absolute path that can be updated over time. Volume: 50,000 files ~= 5 GO You understand that I can't use a

Distcp - Container is running beyond physical memory limits

只谈情不闲聊 提交于 2019-12-02 02:34:01
I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root , with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000 files ~= 30 GO I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest This subset is defined by a list of absolute path that can be updated over time. Volume: 50,000 files ~= 5 GO You understand that I can't use a simple hdfs dfs -cp /hdfs/root /hdfs dest because it is not optimized, it will take every files, and it

Copying files from a hdfs directory to another with oozie distcp-action

ぃ、小莉子 提交于 2019-11-27 07:31:49
问题 My actions start_fair_usage ends with status okey, but test_copy returns Main class [org.apache.oozie.action.hadoop.DistcpMain], main() threw exception, null In /user/comverse/data/${1}_B I have a lot of different files, some of which I want to copy to ${NAME_NODE}/user/evkuzmin/output . For that I try to pass paths from copy_files.sh which holds an array of paths to the files I need. <action name="start_fair_usage"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${JOB_TRACKER}</job