amazon-emr

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

自古美人都是妖i 提交于 2019-12-06 05:48:30
问题 I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided. AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark

Flink TaskManagers do not start until job is submitted in YARN cluster

筅森魡賤 提交于 2019-12-06 05:32:50
I am using Amazon EMR to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 2 core nodes. I have started the Flink CLuster on YARN with the command: flink-yarn-session -n 2 -d -tm 4096 -s 4 . Flink Job Manager and Application Manager starts but there are no Task Managers running. The Flink Web interface shows 0 for task managers, task slots and slots available. However when I submit a job to flink cluster, then Task Managers get allocated and the job runs and the Web UI shows correct values as expected and goes back to 0 once the job is complete. I would like

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

守給你的承諾、 提交于 2019-12-06 03:59:50
问题 I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI". I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml <property><name>yarn.log-aggregation-enable</name><value>true</value><

Running MapReduce jobs on AWS-EMR from Eclipse

瘦欲@ 提交于 2019-12-05 22:17:22
I have the WordCount MapReduce example in Eclipse. I exported it to Jar, and copied it to S3. I then ran it on AWS-EMR. Successfully. Then, I read this article - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html It shows how to use AWS-EMR Api to run MapReduce jobs. It still assumes your MapReduce code is packaged in a Jar. I would like to know if there is a way to run MapReduce code from Eclipse directly on AWS-EMR, without having to export it to a Jar. I haven't found a way to do this (for mapreduce jobs written in Java). I believe there is

how to set livy.server.session.timeout on EMR cluster boostrap?

这一生的挚爱 提交于 2019-12-05 19:02:12
I am creating an EMR cluster, and using jupyter notebook to run some spark tasks. My tasks die after approximately 1 hour of execution, and the error is: An error was encountered: Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: "requirement failed: Session isn't active." My understanding is that it is related to the Livy config livy.server.session.timeout , but I don't know how I can set it in the bootstrap of the cluster (I need to do it in the bootstrap because the cluster is created with no ssh access) Thanks a lot in advance On EMR,

Simple RDD write to DynamoDB in Spark

。_饼干妹妹 提交于 2019-12-05 12:43:23
Just got stuck on trying to import a basic RDD dataset to DynamoDB. This is the code: import org.apache.hadoop.mapred.JobConf var rdd = sc.parallelize(Array(("", Map("col1" -> Map("s" -> "abc"), "col2" -> Map("n" -> "123"))))) var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.output.tableName", "table_x") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") rdd.saveAsHadoopDataset(jobConf) And this is the error I get: 16/02/28 15:40:38 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 18, ip-172-31-9-224.eu-west-1.compute

Hadoop Non-splittable TextInputFormat

百般思念 提交于 2019-12-05 10:33:32
问题 Is there a way to have a whole file sent to a mapper without being split? I have read this but I am wondering if there is another way of doing the same thing without having to generate an intermediate file. Ideally, I would like an existing option on the command line to Hadoop. I am using the streaming facility with Python scripts on Amazon EMR. 回答1: Just set the configuration property mapred.min.split.size to something huge (10G): -D mapred.min.split.size=10737418240 Or compress the input

s3fs on Amazon EMR: Will it scale for approx 100million small files?

荒凉一梦 提交于 2019-12-05 10:11:43
Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression. The size of input will grow gradually as this step aggregates the data over time. For example, the logs till December 2012 will contain: UDID-1, DateTime, Lat, Lng, Location UDID-2, DateTime, Lat, Lng, Location UDID-3, DateTime, Lat, Lng, Location UDID-1, DateTime, Lat, Lng, Location For this we would have to generate separate files with UDID(Unique device identifier) as

how to install custom packages on amazon EMR bootstrap action in code?

做~自己de王妃 提交于 2019-12-05 08:33:55
need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this. Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto. name='Image to grayscale using SimpleCV python package', mapper='s3n://elasticmapreduce/samples/imageGrayScale.py', reducer='aggregate', input='s3n://elasticmapreduce/samples/input', output='s3n://<my output bucket>/output' I need to make it use the SimpleCV python package, but not sure where to specify this. What

How does MapReduce read from multiple input files?

旧时模样 提交于 2019-12-05 06:47:41
问题 I am developing a code to read data and write it into HDFS using mapreduce . However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of String filename = conf1.get("map.input.file"); So how does it process the files in the directory ? 回答1: In order to get the input file path you can use the context object, like this: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String