elastic-map-reduce

Getting data in and out of Elastic MapReduce HDFS

╄→尐↘猪︶ㄣ 提交于 2019-12-05 03:38:50
问题 I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce. What I've been doing is something like this: ./elastic-mapreduce --create --alive JOBID="j-XXX" # output from creation ./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX" ./elastic-mapreduce -j $JOBID --jar s3://bucket-id

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

倾然丶 夕夏残阳落幕 提交于 2019-12-04 18:15:15
i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938 . i am on a windows 7 computer. when i try to run this command, i am shown the help information. ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where shown in pairs, one for *nix and one for windows). ruby elastic-mapreduce RunJobFlow my_job.json my

copy files from amazon s3 to hdfs using s3distcp fails

被刻印的时光 ゝ 提交于 2019-12-04 07:49:17
I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ? Command: ./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users Output Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to

In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages

六月ゝ 毕业季﹏ 提交于 2019-12-04 07:30:52
There must be a way to change the ports 50070 and 50030 so that the following urls display the clustr statuses on the ports i pick NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ Define your choice of ports by setting properties dfs.http.address for Namenode and mapred.job.tracker.http.address for Jobtracker in conf/core-site.xml: <configuration> <property> <name>dfs.http.address</name> <value>50070</value> </property> <property> <name>mapred.job.tracker.http.address</name> <value>50030</value> </property> </configuration> This question is old but probably worth

Getting “No space left on device” for approx. 10 GB of data on EMR m1.large instances

一曲冷凌霜 提交于 2019-12-04 06:47:30
I am getting an error "No space left on device" when I am running my Amazon EMR jobs using m1.large as the instance type for the hadoop instances to be created by the jobflow. The job generates approx. 10 GB of data at max and since the capacity of a m1.large instance is supposed to be 420GB*2 (according to: EC2 instance types ). I am confused how just 10GB of data could lead to a "disk space full" kind of a message. I am aware of the possibility that this kind of an error can also be generated if we have completely exhausted the total number of inodes allowed on the filesystem but that is

Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

 ̄綄美尐妖づ 提交于 2019-12-03 21:34:22
I have my mapper and reducers as follows. But I am getting some kind of strange exception. I can't figure out why is it throwing such kind of exception. public static class MyMapper implements Mapper<LongWritable, Text, Text, Info> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, Info> output, Reporter reporter) throws IOException { Text text = new Text("someText") //process output.collect(text, infoObjeject); } } public static class MyReducer implements Reducer<Text, Info, Text, Text> { @Override public void reduce(Text key, Iterator<Info> values,

Scheduling A Job on AWS EC2

放肆的年华 提交于 2019-12-03 18:42:23
问题 I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this functionality. I've considered the following: 1) Generate a request to the web server that triggers it to do this task I don't like this approach because it ties up a server thread and uses cpu cycles on the host 2) Create a cron job on the machine the web server is running on to execute this task

Amazon Elastic MapReduce Bootstrap Actions not working

拟墨画扇 提交于 2019-12-03 17:31:49
I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the right syntax? You have two options to achieve this: Custom JVM Settings In order to apply custom

Hive: converting a comma separated string to array for table generating function

只愿长相守 提交于 2019-12-03 14:04:56
问题 I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe: http://code.google.com/p/hive-json-serde/ The unencoded file looks like this: {"id":"101", "items":"A:231,234,119,12"} {"id":"102", "items":"B:13,89,121"} ... I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP

The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

让人想犯罪 __ 提交于 2019-12-03 13:25:16
问题 The reduce phase of the job fails with: of failed Reduce Tasks exceeded allowed limit. The reason why each task fails is: Task attempt_201301251556_1637_r_000005_0 failed to report status for 600 seconds. Killing! Problem in detail: The Map phase takes in each record which is of the format: time, rid, data. The data is of the format: data element, and its count. eg: a,1 b,4 c,7 correseponds to the data of a record. The mapper outputs for each data element the data for every record. eg: key: