amazon-emr | 易学教程

Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

阅读更多关于 Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to run. It seems that it has trouble identifying/running an EC2 resource to do the export. When I access EMR through AWS Console, I see entries like this: Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is invalid Why am I getting this message? Do I need to set up/configure something else for the pipeline to

Alternatives for Athena to query the data on S3

阅读更多关于 Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3 . Lets say the data look like: ## S3://Bucket/Country/Month/Day/1.csv S3://Countries/Germany/06/01/1.csv S3://Countries/Germany/06/01/2.csv S3://Countries/Germany/06/01/3.csv S3://Countries/Germany/06/02/1.csv S3://Countries/Germany/06/02/2.csv We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

阅读更多关于 AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running . Here is what would normally see in Athena hisotry tab: I understand that, after I

How to get filename when running mapreduce job on EC2?

阅读更多关于 How to get filename when running mapreduce job on EC2?

I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided. But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ? Thanks. #!/usr/bin/python import sys import re def main(argv): line = sys.stdin.readline() pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") try:

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

阅读更多关于 how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938 . i am on a windows 7 computer. when i try to run this command, i am shown the help information. ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where shown in pairs, one for *nix and one for windows). ruby elastic-mapreduce RunJobFlow my_job.json my

AWS EMR perform “bootstrap” script on all the already running machines in cluster

阅读更多关于 AWS EMR perform “bootstrap” script on all the already running machines in cluster

问题 I have one EMR cluster which is running 24/7. I can't turn it off and launch the new one. What I would like to do is to perform something like bootstrap action on the already running cluster, preferably using Python and boto or AWS CLI. I can imagine doing this in 2 steps: 1) run the script on all the running instances (It would be nice if that would be somehow possible for example from boto) 2) adding the script to bootstrap actions for case that I'd like to resize the cluster. So my

GroupBy Operation of DataFrame takes lot of time in spark 2.0

阅读更多关于 GroupBy Operation of DataFrame takes lot of time in spark 2.0

In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy leads lot of shuffling. It also suggests that for avoiding lot of shuffling ReduceByKey should be used

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

阅读更多关于 Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided. AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html Observations: - Using spark-shell (From EMR Master Node): Works . Able to access Glue DB

Amazon MapReduce best practices for logs analysis

阅读更多关于 Amazon MapReduce best practices for logs analysis

问题 I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent. Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable. Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

AWS EMR performance HDFS vs S3

阅读更多关于 AWS EMR performance HDFS vs S3

In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS. Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed. Any