amazon-emr

How to install a GUI on Amazon AWS EC2 or EMR with the Amazon AMI

白昼怎懂夜的黑 提交于 2019-11-30 16:13:30
I have a need to run an application that requires a GUI interface to start and configure. I also need to be able to run this application on Amazon's EC2 service and EMR service. The EMR requirement means it has to run on Amazon's Linux AMI. After extensive searching I've been unable to find any ready made solutions, in particular the requirement to run on Amazon's AMI. The closest match and most often referenced solution is here . Unfortunately it was developed on a RHEL6 instance which differs enough from Amazon's AMI that the solution does not work. I'm posting my solution below. Hopefully

AWS Glue pricing against AWS EMR

拥有回忆 提交于 2019-11-30 14:08:29
I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around

How to launch and configure an EMR cluster using boto

*爱你&永不变心* 提交于 2019-11-30 03:26:26
I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes) Am I missing something? Boto and the underlying EMR API is currently mixing the terms cluster and job flow , and job flow is being deprecated . I consider them synonyms. You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

泪湿孤枕 提交于 2019-11-30 03:23:42
I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes? If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong. I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes? CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2

How do you make a HIVE table out of JSON data?

♀尐吖头ヾ 提交于 2019-11-29 19:44:39
I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table? Does anyone have some example command to get me started, I can't find anything useful with Google ... You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table. A really good example showing you how is here: http://aws.amazon.com/articles/2855 Unfortunately the JSON serde supplied

How to install a GUI on Amazon AWS EC2 or EMR with the Amazon AMI

拜拜、爱过 提交于 2019-11-29 15:22:47
问题 I have a need to run an application that requires a GUI interface to start and configure. I also need to be able to run this application on Amazon's EC2 service and EMR service. The EMR requirement means it has to run on Amazon's Linux AMI. After extensive searching I've been unable to find any ready made solutions, in particular the requirement to run on Amazon's AMI. The closest match and most often referenced solution is here. Unfortunately it was developed on a RHEL6 instance which

hadoop copying from hdfs to S3

末鹿安然 提交于 2019-11-29 12:59:22
I've successfully completed mahout vectorizing job on Amazon EMR (using Mahout on Elastic MapReduce as reference). Now I want to copy results from HDFS to S3 (to use it in future clustering). For that I've used hadoop distcp: den@aws:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \ > --arg hdfs://my.bucket/prj1/seqfiles \ > --arg s3n://ACCESS_KEY:SECRET_KEY@my.bucket/prj1/seqfiles \ > -j $JOBID Failed. Found that suggestion: Use s3distcp Tried it also: elastic-mapreduce --jobflow $JOBID \ > --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

无人久伴 提交于 2019-11-29 11:08:32
I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x , so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR 3.x. It works well, so I hope others find this useful (including the answer for emr 4.x/5.x), as the

Does an EMR master node know its cluster ID?

自作多情 提交于 2019-11-29 03:07:48
I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about. Does the master node know its ID ( j-************* )? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID? I've taken a look through the config files in /home/hadoop/conf , and I haven't found

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

独自空忆成欢 提交于 2019-11-28 20:36:15
I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead. Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours). I have already contacted Amazon Premium Support, but so far they only told that "for some