elastic-map-reduce | 易学教程

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

阅读更多关于 How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

问题 I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input

Loading data with Hive, S3, EMR, and Recover Partitions

阅读更多关于 Loading data with Hive, S3, EMR, and Recover Partitions

SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: s3://($BUCKET)/y=2012/m=11/d=09/H=10/ I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like: create external table logs (content string) partitioned by (y string, m string, d string, h string) location 's3://($BUCKET)'; Creation of the table works. I then attempt to recover all of the existing partitions: alter table logs recover partitions; This seems to work and it does drill down through my

The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

阅读更多关于 The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

The reduce phase of the job fails with: of failed Reduce Tasks exceeded allowed limit. The reason why each task fails is: Task attempt_201301251556_1637_r_000005_0 failed to report status for 600 seconds. Killing! Problem in detail: The Map phase takes in each record which is of the format: time, rid, data. The data is of the format: data element, and its count. eg: a,1 b,4 c,7 correseponds to the data of a record. The mapper outputs for each data element the data for every record. eg: key:(time, a,), val: (rid,data) key:(time, b,), val: (rid,data) key:(time, c,), val: (rid,data) Every reduce

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

阅读更多关于 How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar. We can use the following way to specify these configurations when we run using external scripting languages like ruby or python: ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output I tried the following ways, but none of them worked:

Sharing data between master and reduce

阅读更多关于 Sharing data between master and reduce

问题 I need to perform aggregation using the results form all the reduce tasks. Basically the reduce task finds the sum and count and a value. I need to add all the sums and counts and find the final average. I tried using conf.setInt in reduce. But when I try to access it from the main function it fails class Main { public static class MyReducer extends Reducer<Text, Text,Text,IntWritable> { public void reduce(Text key, Iterable<Text> values, Context context ) throws IOException,

hadoop converting \r\n to \n and breaking ARC format

阅读更多关于 hadoop converting \r\n to \n and breaking ARC format

问题 I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb It works as expected. It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC

Get a yarn configuration from commandline

阅读更多关于 Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS. > hdfs getconf -confKey fs.defaultFS hdfs://localhost:19000 > hdfs getconf -confKey dfs.namenode.name.dir file:///Users/chris/hadoop-deploy-trunk/data/dfs/name > hdfs getconf -confKey yarn.resourcemanager.address 0.0.0.0

How can correct data types on Apache Pig be enforced?

阅读更多关于 How can correct data types on Apache Pig be enforced?

问题 I am having trouble SUMming a bag of values, due to a Data type error. When I load a csv file whose lines look like this: 6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong Using the following: logs_base = FOREACH raw_logs GENERATE FLATTEN( EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"') ) as ( account_id: int, bytes: long,

Setting hadoop parameters with boto?

阅读更多关于 Setting hadoop parameters with boto?

I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here: http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code The link above says that I need to somehow set the following configuration parameters on an EMR job: mapred.skip.mode.enabled=true mapred.skip.map.max.skip.records=1 mapred.skip.attempts.to.start.skipping=2 mapred.map.tasks=1000 mapred.map.max.attempts=10 How do I set these (and other) mapred.XXX parameters on a JobFlow using Boto? After many hours of struggling, reading code, and

Deleting file/folder from Hadoop

阅读更多关于 Deleting file/folder from Hadoop

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails : Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.208.42.127:9000/home/hadoop/temp-output-s3copy already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth