elastic-map-reduce

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

雨燕双飞 提交于 2019-12-03 13:08:14
问题 I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input

Loading data with Hive, S3, EMR, and Recover Partitions

一个人想着一个人 提交于 2019-12-03 06:04:09
SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: s3://($BUCKET)/y=2012/m=11/d=09/H=10/ I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like: create external table logs (content string) partitioned by (y string, m string, d string, h string) location 's3://($BUCKET)'; Creation of the table works. I then attempt to recover all of the existing partitions: alter table logs recover partitions; This seems to work and it does drill down through my

The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

被刻印的时光 ゝ 提交于 2019-12-03 04:28:17
The reduce phase of the job fails with: of failed Reduce Tasks exceeded allowed limit. The reason why each task fails is: Task attempt_201301251556_1637_r_000005_0 failed to report status for 600 seconds. Killing! Problem in detail: The Map phase takes in each record which is of the format: time, rid, data. The data is of the format: data element, and its count. eg: a,1 b,4 c,7 correseponds to the data of a record. The mapper outputs for each data element the data for every record. eg: key:(time, a,), val: (rid,data) key:(time, b,), val: (rid,data) key:(time, c,), val: (rid,data) Every reduce

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

耗尽温柔 提交于 2019-12-03 03:24:28
I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output I tried the following ways, but none of them worked:

Sharing data between master and reduce

依然范特西╮ 提交于 2019-12-02 13:18:41
问题 I need to perform aggregation using the results form all the reduce tasks. Basically the reduce task finds the sum and count and a value. I need to add all the sums and counts and find the final average. I tried using conf.setInt in reduce. But when I try to access it from the main function it fails class Main { public static class MyReducer extends Reducer<Text, Text,Text,IntWritable> { public void reduce(Text key, Iterable<Text> values, Context context ) throws IOException,

hadoop converting \r\n to \n and breaking ARC format

吃可爱长大的小学妹 提交于 2019-12-01 20:46:01
问题 I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb It works as expected. It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC

Get a yarn configuration from commandline

天涯浪子 提交于 2019-11-30 17:53:05
In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS. > hdfs getconf -confKey fs.defaultFS hdfs://localhost:19000 > hdfs getconf -confKey dfs.namenode.name.dir file:///Users/chris/hadoop-deploy-trunk/data/dfs/name > hdfs getconf -confKey yarn.resourcemanager.address 0.0.0.0

How can correct data types on Apache Pig be enforced?

耗尽温柔 提交于 2019-11-30 17:52:54
问题 I am having trouble SUMming a bag of values, due to a Data type error. When I load a csv file whose lines look like this: 6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong Using the following: logs_base = FOREACH raw_logs GENERATE FLATTEN( EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"') ) as ( account_id: int, bytes: long,

Setting hadoop parameters with boto?

大兔子大兔子 提交于 2019-11-30 09:37:40
I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here: http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code The link above says that I need to somehow set the following configuration parameters on an EMR job: mapred.skip.mode.enabled=true mapred.skip.map.max.skip.records=1 mapred.skip.attempts.to.start.skipping=2 mapred.map.tasks=1000 mapred.map.max.attempts=10 How do I set these (and other) mapred.XXX parameters on a JobFlow using Boto? After many hours of struggling, reading code, and

Deleting file/folder from Hadoop

时间秒杀一切 提交于 2019-11-30 08:18:26
I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails : Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.208.42.127:9000/home/hadoop/temp-output-s3copy already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth