elastic-map-reduce | 易学教程

Exporting Hive Table to a S3 bucket

阅读更多关于 Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport; I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance. Does anyone know how to do this? user495732 Why Me Yes you have to export and import your data at the start and end of your hive session To do this you need to create a table

Get a yarn configuration from commandline

阅读更多关于 Get a yarn configuration from commandline

问题 In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb 回答1: It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS. > hdfs getconf -confKey fs.defaultFS hdfs://localhost:19000 > hdfs getconf -confKey dfs.namenode.name.dir file://

Scheduling A Job on AWS EC2

阅读更多关于 Scheduling A Job on AWS EC2

I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this functionality. I've considered the following: 1) Generate a request to the web server that triggers it to do this task I don't like this approach because it ties up a server thread and uses cpu cycles on the host 2) Create a cron job on the machine the web server is running on to execute this task Again, I don't like this approach because it takes cpu cycles away from the web server 3) Create another

Setting hadoop parameters with boto?

阅读更多关于 Setting hadoop parameters with boto?

问题 I am trying to enable bad input skipping on my Amazon Elastic MapReduce jobs. I am following the wonderful recipe described here: http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code The link above says that I need to somehow set the following configuration parameters on an EMR job: mapred.skip.mode.enabled=true mapred.skip.map.max.skip.records=1 mapred.skip.attempts.to.start.skipping=2 mapred.map.tasks=1000 mapred.map.max.attempts=10 How do I set these (and other)

Deleting file/folder from Hadoop

阅读更多关于 Deleting file/folder from Hadoop

问题 I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails : Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.208.42.127:9000/home/hadoop/temp-output-s3copy already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run

Too many open files in EMR

阅读更多关于 Too many open files in EMR

I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:257) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

阅读更多关于 How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x , so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR 3.x. It works well, so I hope others find this useful (including the answer for emr 4.x/5.x), as the

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

阅读更多关于 Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course). I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true". However, if I resize the emr cluster by adding nodes to the CORE pool of worker machines, YARN only adds some of the new nodes to the spark job. For example, this

How do you use Python UDFs with Pig in Elastic MapReduce?

阅读更多关于 How do you use Python UDFs with Pig in Elastic MapReduce?

I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged: ERROR 2998: Unhandled internal error. org/python/core/PyException java.lang.NoClassDefFoundError: org/python/core/PyException at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127) at org.apache.pig.PigServer.registerCode(PigServer.java:568) at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421) at org

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

阅读更多关于 How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

问题 According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) . However, on setting these parameters to (for a cluster of c3.2xlarges): yarn.nodemanager.resource.memory-mb = 14336 mapreduce.map.memory.mb = 2048 yarn