s3distcp

Using GroupBy while copying from HDFS to S3 to merge files within a folder

谁说我不能喝 提交于 2020-01-05 08:48:09
问题 I have the following folders in HDFS : hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 hdfs://x.x

s3-dist-cp and hadoop distcp job infinitely loopin in EMR

喜夏-厌秋 提交于 2019-12-25 07:44:58
问题 I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp: s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/ hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/ I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my

Deduce the HDFS path at runtime on EMR

自闭症网瘾萝莉.ら 提交于 2019-12-12 04:36:20
问题 I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip. The first EMR step is: hadoop fs -mkdir /input - This step completed successfully. The second EMR step is: Following is the command I am using: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED I get the following exception Error

Adding S3DistCp to PySpark

自闭症网瘾萝莉.ら 提交于 2019-12-12 01:29:50
问题 I'm trying to add S3DistCp to my local, standalone Spark install. I've downloaded S3DistCp: aws s3 cp s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar . And the AWS SDK as well: wget http://sdk-for-java.amazonwebservices.com/latest/aws-java-sdk.zip I extracted the AWS SDK: unzip aws-java-sdk.zip Then added s3distcp.jar to my spark-defaults.conf : spark.driver.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp.jar spark.executor.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp