amazon-emr | 易学教程

Run HiveFromSpark example with MASTER=yarn-cluster

阅读更多关于 Run HiveFromSpark example with MASTER=yarn-cluster

问题 I'm trying to run HiveFromSpark example on my EMR Spark/Hive cluster. The Problem Using yarn-client: ~/spark/bin/spark-submit --master yarn-client --num-executors=19 --class org.apache.spark.examples.sql.hive.HiveFromSpark ~/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar works like a charm. But, using yarn-cluster: ~/spark/bin/spark-submit --master yarn-cluster --num-executors=19 --class org.apache.spark.examples.sql.hive.HiveFromSpark ~/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar fails

Running MapReduce jobs on AWS-EMR from Eclipse

阅读更多关于 Running MapReduce jobs on AWS-EMR from Eclipse

问题 I have the WordCount MapReduce example in Eclipse. I exported it to Jar, and copied it to S3. I then ran it on AWS-EMR. Successfully. Then, I read this article - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html It shows how to use AWS-EMR Api to run MapReduce jobs. It still assumes your MapReduce code is packaged in a Jar. I would like to know if there is a way to run MapReduce code from Eclipse directly on AWS-EMR, without having to export

Why is my Spark App running in only 1 executor?

阅读更多关于 Why is my Spark App running in only 1 executor?

问题 I'm still fairly new to Spark but I have been able to create the Spark App I need to be able to reprocess data from our SQL Server using JDBC drivers ( we are removing expensive SPs ), the app loads a few tables from Sql Server via JDBC into dataframes, then I do a few joins, a group, and a filter finally reinserting some data back via JDBC the results to a different table. All this executes just fine at Spark EMR in Amazon Web Services in a m3.xlarge with 2 cores in around a minute. My

how to install custom packages on amazon EMR bootstrap action in code?

阅读更多关于 how to install custom packages on amazon EMR bootstrap action in code?

问题 need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this. Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto. name='Image to grayscale using SimpleCV python package', mapper='s3n://elasticmapreduce/samples/imageGrayScale.py', reducer='aggregate', input='s3n://elasticmapreduce/samples/input', output='s3n://<my output bucket

s3fs on Amazon EMR: Will it scale for approx 100million small files?

阅读更多关于 s3fs on Amazon EMR: Will it scale for approx 100million small files?

问题 Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression. The size of input will grow gradually as this step aggregates the data over time. For example, the logs till December 2012 will contain: UDID-1, DateTime, Lat, Lng, Location UDID-2, DateTime, Lat, Lng, Location UDID-3, DateTime, Lat, Lng, Location UDID-1, DateTime, Lat, Lng,

Alternatives for Athena to query the data on S3

阅读更多关于 Alternatives for Athena to query the data on S3

问题 I have around 300 GBs of data on S3 . Lets say the data look like: ## S3://Bucket/Country/Month/Day/1.csv S3://Countries/Germany/06/01/1.csv S3://Countries/Germany/06/01/2.csv S3://Countries/Germany/06/01/3.csv S3://Countries/Germany/06/02/1.csv S3://Countries/Germany/06/02/2.csv We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

阅读更多关于 how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

问题 i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938. i am on a windows 7 computer. when i try to run this command, i am shown the help information. ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where

How to get filename when running mapreduce job on EC2?

阅读更多关于 How to get filename when running mapreduce job on EC2?

问题 I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided. But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ? Thanks. #!/usr/bin/python import sys

list S3 folder on EMR

阅读更多关于 list S3 folder on EMR

I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job. I wanted to do the following Configuration conf = spark.sparkContext().hadoopConfiguration(); FileSystem s3 = S3FileSystem.get(conf); List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false)) This always fails with the following error java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020 in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020 The way I

GroupBy Operation of DataFrame takes lot of time in spark 2.0

阅读更多关于 GroupBy Operation of DataFrame takes lot of time in spark 2.0

问题 In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy