elastic-map-reduce | 易学教程

Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

阅读更多关于 Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

问题 I have my mapper and reducers as follows. But I am getting some kind of strange exception. I can't figure out why is it throwing such kind of exception. public static class MyMapper implements Mapper<LongWritable, Text, Text, Info> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, Info> output, Reporter reporter) throws IOException { Text text = new Text("someText") //process output.collect(text, infoObjeject); } } public static class MyReducer implements Reducer

Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

阅读更多关于 Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

Exporting Hive Table to a S3 bucket

阅读更多关于 Exporting Hive Table to a S3 bucket

问题 I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport; I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance. Does anyone know how to do this? 回答1: Yes you have to export and import

How to use external data with Elastic MapReduce

阅读更多关于 How to use external data with Elastic MapReduce

问题 From Amazon's EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3? Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB. What are the specifications for loading data from external (non-S3) sources? There seems to be a dearth of resources around this option

Amazon EMR: Configuring storage on data nodes

阅读更多关于 Amazon EMR: Configuring storage on data nodes

问题 I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space. Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores. When I

Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?

阅读更多关于 Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?

问题 I'm having an issue where my Hadoop job on AWS's EMR is not being saved to S3. When I run the job on a smaller sample, the job stores the output just fine. When I run the same command but on my full dataset, the job completes again, but there is nothing existing on S3 where I specified my output to go. Apparently there was a bug with AWS EMR in 2009, but it was "fixed". Anyone else ever have this problem? I still have my cluster online, hoping that the data is buried on the servers somewhere.

Is it possible to jar up an executable so that it can be run from Java?

阅读更多关于 Is it possible to jar up an executable so that it can be run from Java?

问题 Simply put, I need to be able to stick a compiled executable inside a Java jar file and then be able to run it from Java (probably via ProcessBuilder ). The why , is that I would like to use a Java wrapper around the ImageMagick executable as component of an image processing Elastic Map Reduce job. EMR only expects to take a jar file, so I don't think there's any room to install software on the data nodes that spin up. 回答1: The executable into the jar is a resource, you may access it via a

Too many open files in EMR

阅读更多关于 Too many open files in EMR

问题 I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:257) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

阅读更多关于 How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

问题 I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR

How to force Hadoop to unzip inputs regadless of their extension?

阅读更多关于 How to force Hadoop to unzip inputs regadless of their extension?

问题 I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension. Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension. I tried passing the following flags to Hadoop: step_args=[ "-jobconf", "stream.recordreader