I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command
bin/hadoop jar
Try using Amazon Elastic MapReduce. It removes the need for configuring the hadoop nodes, and you can just access objects in your s3 account in the way you expect.
Use
-Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key>
e.g.
hadoop distcp -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>
or
hadoop fs -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>
You probably want to use s3n:// urls, not s3:// urls. s3n:// means "A regular file, readable from the outside world, at this S3 url". s3:// refers to an HDFS file system mapped into an S3 bucket.
To avoid the URL escaping issue for the access key (and to make life much easier), put them into the /etc/hadoop/conf/core-site.xml
file:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>0123458712355</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>hi/momasgasfglskfghaslkfjg</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>0123458712355</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>hi/momasgasfglskfghaslkfjg</value>
</property>
There was at one point an outstanding issue with secret keys that had a slash -- the URL was decoded in some contexts but not in others. I don't know if it's been fixed, but I do know that with the keys in the .conf this goes away.
Other quickies:
hadoop fs -cp s3n://myhappybucket/
or hadoop fs -cp s3n://myhappybucket/happyfile.txt /tmp/dest1
and even hadoop fs -cp /tmp/some_hdfs_file s3n://myhappybucket/will_be_put_into_s3
distcp
command runs a mapper-only command to copy a tree from there to here. Use it if you want to copy a very large number of files to the HDFS. (For everyday use, hadoop fs -cp src dest
works just fine).SocketTimeoutException
's, apply the patch for HADOOP-6254. We were, and we did, and they went away.You can also you Apache Whirr for this workflow. Check the Quick Start Guide and the 5 minutes guide for more info.
Disclaimer: I'm one of the committers.