amazon-emr

Java SDK AWS EMR gives Failed to download error

我怕爱的太早我们不能终老 提交于 2019-12-11 15:44:09
问题 If you follow https://docs.aws.amazon.com/emr/latest/ManagementGuide/calling-emr-with-java-sdk.html and you are not in us-east-1 then you'll get 2019-06-11T08:39:00.283Z INFO Ensure step 1 jar file s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar INFO Failed to download: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar java.lang.RuntimeException: Error whilst fetching 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar' at aws157

Failed to create client - Spark as execution engine with hive

浪尽此生 提交于 2019-12-11 14:54:38
问题 I have a 32GB single node Amazon EMR cluster with hive 2.3.4, spark 2.4.2 installed and Hadoop 2.8.5. I am trying to configure spark as the execution engine for hive. I have linked the spark jar files in hive via the following command: sudo ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.2.jar sudo ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.2.jar sudo ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar I have set execution engine in the hive-site.xml file as well. I have added the

AWS EMR run spark job/step synchronously

試著忘記壹切 提交于 2019-12-11 14:32:35
问题 Is it possible to run/submit Spark Step synchronously? I am trying to run the Spark step on AWS EMR cluster from Java App. I am trying to implement a service that runs the job and returns the results. I am exploring client mode execution of AWS EMR. But not sure if thats the right answer. Any guidance or links would be great help. High level workflow for api endpoint looks like this. /getMeStat -> add step to AWS EMR (EMR runs the job, generate result) -> returns results 来源: https:/

How to convert .net DateTime.Ticks to Hive DateTime in query?

泪湿孤枕 提交于 2019-12-11 13:52:07
问题 I have log file with a column in DateTime.Ticks (635677577653488758) which i am trying to convert it to Date in Hadoop Hive. First i tried the code block below on MySql and it worked. But the same code didn't work in Hive because date_add function works with INT. SELECT DATE_ADD('2001-01-01 00:00:00', INTERVAL (MAX(f.date) - 631139040000000000)/10 MICROSECOND); Then i will format it like this... SELECT DATE_FORMAT(MyDateFromTicks, '%Y-%m-%dT%T.%fZ'); How can i achieve this? Thank you. 回答1: I

Query results difference between EMR-Presto and Athena

感情迁移 提交于 2019-12-11 13:05:04
问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

Slow or incomplete saveAsParquetFile from EMR Spark to S3

若如初见. 提交于 2019-12-11 12:18:54
问题 I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random . I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of memory. I've tried with the maximum number of executors, and one executor per node. // create some random data for performance and scalability testing val df = sqlContext.range(0,1000).map(x => Row.fromSeq((1 to 100).map(y => math.Random))) df

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

老子叫甜甜 提交于 2019-12-11 08:12:13
问题 I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)? For distcp, I run (following this post's recommendation): elastic-mapreduce --jobflow <MY

Using s3 as fs.default.name or HDFS?

元气小坏坏 提交于 2019-12-11 07:54:42
问题 I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks? Thanks! 回答1: in order to use S3 instead of HDFS fs.name.default

How does EMR handle an s3 bucket for input and output?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 07:33:43
问题 I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found' 回答1: MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an

Huge delays translating the DAG to tasks

ε祈祈猫儿з 提交于 2019-12-11 07:27:24
问题 this are my steps: Submit the spark app to a EMR cluster The driver starts and I can see the Spark-ui (no stages have been created yet) The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3 The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui The stages appear and start the execution Why am I getting that huge delay in step 4? During this time the cluster is