elastic-map-reduce

Running MapReduce jobs on AWS-EMR from Eclipse

a 夏天 提交于 2019-12-07 13:30:24
问题 I have the WordCount MapReduce example in Eclipse. I exported it to Jar, and copied it to S3. I then ran it on AWS-EMR. Successfully. Then, I read this article - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html It shows how to use AWS-EMR Api to run MapReduce jobs. It still assumes your MapReduce code is packaged in a Jar. I would like to know if there is a way to run MapReduce code from Eclipse directly on AWS-EMR, without having to export

Life of distributed cache in Hadoop

心不动则不痛 提交于 2019-12-07 04:42:25
问题 When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce? 回答1: I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops

How to know job flow id, other cluster parameters in script running via script-runner.jar

大憨熊 提交于 2019-12-06 15:48:15
I'm starting an elastic mapreduce cluster with the following command-line: $ elastic-mapreduce \ --create \ --num-instances "${INSTANCES}" \ --instance-type m1.medium \ --ami-version 3.0.4 \ --name "${CLUSTER_NAME}" \ --log-uri "s3://my-bucket/elasticmapreduce/logs" \ --step-name "${STEP_NAME}" \ --step-action TERMINATE_JOB_FLOW \ --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \ --arg s3://my-bucket/log-parser/code/hadoop-script.sh \ --arg "${CLUSTER_NAME}" \ --arg "${STEP_NAME}" \ --arg s3n://my-bucket/log-parser/input \ --arg s3n://my-bucket/log-parser/output I would like

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

谁都会走 提交于 2019-12-06 13:53:47
问题 i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938. i am on a windows 7 computer. when i try to run this command, i am shown the help information. ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where

Elastic Search Nested Query with Nested Object

青春壹個敷衍的年華 提交于 2019-12-06 09:11:58
问题 This is the type of data I have stored on my index in elastic search. I have to find Recipes with Main Ingredient Beef(and weight less than 1000) with Ingredients -(chilli powder and weight less than 250),(olive oil & weight less than 300 )and similarly for all other ingredients. "Name": "Real beef burritos", "Ingredients": [ {"name": "olive oil", "id": 27, "weight": 200}, {"name": "bonion","id": 3,"weight": 300}, {"name": "garlic", "id": 2, "weight": 100 }, {"name": "chilli powder", "id": 35

DynamoDB InputFormat for Hadoop

不问归期 提交于 2019-12-06 07:48:31
问题 I have to process some data which is persisted in Amazon Dynamo DB using Hadoop map reduce. I was searching over internet for Hadoop InputFormat for Dynamo DB and couldn't find it. I'm not familiar with Dynamo DB so I'm guessing there is some trick related to DynamoDB and Hadoop? If there is anywhere implementation of this Input Format could you please share it? 回答1: After a lot of searching I found DynamoDBInputFormat and DynamoDBOutputFormat in one of Amazon's libraries. On amazon elastic

copy files from amazon s3 to hdfs using s3distcp fails

独自空忆成欢 提交于 2019-12-06 04:00:01
问题 I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ? Command: ./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users Output Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK

Trouble using hbase from java on Amazon EMR

眉间皱痕 提交于 2019-12-06 03:49:57
问题 So Im trying to query my hbase cluster on Amazon ec2 using a custom jar i launch as a MapReduce step. Im my jar (inside the map function) I call Hbase as so: public void map( Text key, BytesWritable value, Context contex ) throws IOException, InterruptedException { Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "tablename"); ... the problem is that when it gets to that HTable line and tries to connect to hbase, the step fails and I get the following errors:

Hive — split data across files

夙愿已清 提交于 2019-12-06 01:33:53
问题 Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files. I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster. I was looking at https://cwiki.apache.org/Hive

Running MapReduce jobs on AWS-EMR from Eclipse

瘦欲@ 提交于 2019-12-05 22:17:22
I have the WordCount MapReduce example in Eclipse. I exported it to Jar, and copied it to S3. I then ran it on AWS-EMR. Successfully. Then, I read this article - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html It shows how to use AWS-EMR Api to run MapReduce jobs. It still assumes your MapReduce code is packaged in a Jar. I would like to know if there is a way to run MapReduce code from Eclipse directly on AWS-EMR, without having to export it to a Jar. I haven't found a way to do this (for mapreduce jobs written in Java). I believe there is