amazon-emr | 易学教程

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

阅读更多关于 Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

问题 I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided. AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark

Flink TaskManagers do not start until job is submitted in YARN cluster

阅读更多关于 Flink TaskManagers do not start until job is submitted in YARN cluster

I am using Amazon EMR to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 2 core nodes. I have started the Flink CLuster on YARN with the command: flink-yarn-session -n 2 -d -tm 4096 -s 4 . Flink Job Manager and Application Manager starts but there are no Task Managers running. The Flink Web interface shows 0 for task managers, task slots and slots available. However when I submit a job to flink cluster, then Task Managers get allocated and the job runs and the Web UI shows correct values as expected and goes back to 0 once the job is complete. I would like

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

阅读更多关于 YARN log aggregation on AWS EMR - UnsupportedFileSystemException

问题 I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI". I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml <property><name>yarn.log-aggregation-enable</name><value>true</value><

Running MapReduce jobs on AWS-EMR from Eclipse

阅读更多关于 Running MapReduce jobs on AWS-EMR from Eclipse

I have the WordCount MapReduce example in Eclipse. I exported it to Jar, and copied it to S3. I then ran it on AWS-EMR. Successfully. Then, I read this article - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html It shows how to use AWS-EMR Api to run MapReduce jobs. It still assumes your MapReduce code is packaged in a Jar. I would like to know if there is a way to run MapReduce code from Eclipse directly on AWS-EMR, without having to export it to a Jar. I haven't found a way to do this (for mapreduce jobs written in Java). I believe there is

how to set livy.server.session.timeout on EMR cluster boostrap?

阅读更多关于 how to set livy.server.session.timeout on EMR cluster boostrap?

I am creating an EMR cluster, and using jupyter notebook to run some spark tasks. My tasks die after approximately 1 hour of execution, and the error is: An error was encountered: Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: "requirement failed: Session isn't active." My understanding is that it is related to the Livy config livy.server.session.timeout , but I don't know how I can set it in the bootstrap of the cluster (I need to do it in the bootstrap because the cluster is created with no ssh access) Thanks a lot in advance On EMR,

Simple RDD write to DynamoDB in Spark

阅读更多关于 Simple RDD write to DynamoDB in Spark

Just got stuck on trying to import a basic RDD dataset to DynamoDB. This is the code: import org.apache.hadoop.mapred.JobConf var rdd = sc.parallelize(Array(("", Map("col1" -> Map("s" -> "abc"), "col2" -> Map("n" -> "123"))))) var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.output.tableName", "table_x") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") rdd.saveAsHadoopDataset(jobConf) And this is the error I get: 16/02/28 15:40:38 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 18, ip-172-31-9-224.eu-west-1.compute

Hadoop Non-splittable TextInputFormat

阅读更多关于 Hadoop Non-splittable TextInputFormat

问题 Is there a way to have a whole file sent to a mapper without being split? I have read this but I am wondering if there is another way of doing the same thing without having to generate an intermediate file. Ideally, I would like an existing option on the command line to Hadoop. I am using the streaming facility with Python scripts on Amazon EMR. 回答1: Just set the configuration property mapred.min.split.size to something huge (10G): -D mapred.min.split.size=10737418240 Or compress the input

s3fs on Amazon EMR: Will it scale for approx 100million small files?

阅读更多关于 s3fs on Amazon EMR: Will it scale for approx 100million small files?

Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression. The size of input will grow gradually as this step aggregates the data over time. For example, the logs till December 2012 will contain: UDID-1, DateTime, Lat, Lng, Location UDID-2, DateTime, Lat, Lng, Location UDID-3, DateTime, Lat, Lng, Location UDID-1, DateTime, Lat, Lng, Location For this we would have to generate separate files with UDID(Unique device identifier) as

how to install custom packages on amazon EMR bootstrap action in code?

阅读更多关于 how to install custom packages on amazon EMR bootstrap action in code?

need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this. Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto. name='Image to grayscale using SimpleCV python package', mapper='s3n://elasticmapreduce/samples/imageGrayScale.py', reducer='aggregate', input='s3n://elasticmapreduce/samples/input', output='s3n://<my output bucket>/output' I need to make it use the SimpleCV python package, but not sure where to specify this. What

How does MapReduce read from multiple input files?

阅读更多关于 How does MapReduce read from multiple input files?

问题 I am developing a code to read data and write it into HDFS using mapreduce . However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of String filename = conf1.get("map.input.file"); So how does it process the files in the directory ? 回答1: In order to get the input file path you can use the context object, like this: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String