amazon-emr

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

风流意气都作罢 提交于 2019-12-04 07:30:55
I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the Stoftware configuration page of EMR: My question is, how/where I can configure the Spark interpreter

Running steps of EMR in parallel

梦想的初衷 提交于 2019-12-04 05:26:13
I am running a spark-job on EMR cluster ,The issue i am facing is all the EMR jobs triggered are executing in steps (in queue) Is there any way to make them run parallel if not is there any alteration for that Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes. Running multiple concurrent jobs in an

How do I make matplotlib work in AWS EMR Jupyter notebook?

感情迁移 提交于 2019-12-03 22:54:00
This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in error, it's fairly simple: notebook import matplotlib matplotlib.use("agg") import matplotlib.pyplot as plt plt.plot([1,2,3,4]) plt.show() I chose this snippet because this line alone fails as it tries to use TKinter (which is not installed on an AWS EMR cluster): import matplotlib.pyplot as plt When I run the full notebook snippet, the result is no

Hadoop Non-splittable TextInputFormat

╄→гoц情女王★ 提交于 2019-12-03 21:53:39
Is there a way to have a whole file sent to a mapper without being split? I have read this but I am wondering if there is another way of doing the same thing without having to generate an intermediate file. Ideally, I would like an existing option on the command line to Hadoop. I am using the streaming facility with Python scripts on Amazon EMR. Just set the configuration property mapred.min.split.size to something huge (10G): -D mapred.min.split.size=10737418240 Or compress the input file using a codec that isn't splittable (Gzip). With the .gz extension, TextInputFormat will return false to

How does MapReduce read from multiple input files?

我是研究僧i 提交于 2019-12-03 21:12:23
I am developing a code to read data and write it into HDFS using mapreduce . However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of String filename = conf1.get("map.input.file"); So how does it process the files in the directory ? In order to get the input file path you can use the context object, like this: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String inputFilePath = fileSplit.getPath().toString(); And as for how it multiple files are processed: Several instances

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

随声附和 提交于 2019-12-03 19:03:47
问题 I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes? If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong. I am using following command to create external table. Is there a way to specify

Amazon Elastic MapReduce Bootstrap Actions not working

拟墨画扇 提交于 2019-12-03 17:31:49
I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the right syntax? You have two options to achieve this: Custom JVM Settings In order to apply custom

Running EMR Spark With Multiple S3 Accounts

◇◆丶佛笑我妖孽 提交于 2019-12-03 17:20:09
问题 I have an EMR Spark Job that needs to read data from S3 on one account and write to another. I split my job into two steps. read data from the S3 (no credentials required because my EMR cluster is in the same account). read data in the local HDFS created by step 1 and write it to an S3 bucket in another account. I've attempted setting the hadoopConfiguration : sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your

Spark SQL fails because “Constant pool has grown past JVM limit of 0xFFFF”

故事扮演 提交于 2019-12-03 14:40:45
I am running this code on EMR 4.6.0 + Spark 1.6.1 : val sqlContext = SQLContext.getOrCreate(sc) val inputRDD = sqlContext.read.json(input) try { inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output) logger.info("DONE!") } catch { case e : Throwable => logger.error("ERROR" + e.getMessage) } In the last stage of saveAsTextFile , it fails with this error: 16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF /* 001 */ /

file path in hdfs

人盡茶涼 提交于 2019-12-03 12:03:11
I want to read the file from the Hadoop File System. In order to achieve the correct path of the file, I need host name and port address of the hdfs . so finally my path of the file will look something like Path path = new Path("hdfs://123.23.12.4344:9000/user/filename.txt") Now I want to know to extract the HostName = "123.23.12.4344" & port: 9000? Basically, I want to access the FileSystem on Amazon EMR but, when I use FileSystem fs = FileSystem.get(getConf()); I get You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system