amazon-emr | 易学教程

Java SDK AWS EMR gives Failed to download error

阅读更多关于 Java SDK AWS EMR gives Failed to download error

问题 If you follow https://docs.aws.amazon.com/emr/latest/ManagementGuide/calling-emr-with-java-sdk.html and you are not in us-east-1 then you'll get 2019-06-11T08:39:00.283Z INFO Ensure step 1 jar file s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar INFO Failed to download: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar java.lang.RuntimeException: Error whilst fetching 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar' at aws157

Failed to create client - Spark as execution engine with hive

阅读更多关于 Failed to create client - Spark as execution engine with hive

问题 I have a 32GB single node Amazon EMR cluster with hive 2.3.4, spark 2.4.2 installed and Hadoop 2.8.5. I am trying to configure spark as the execution engine for hive. I have linked the spark jar files in hive via the following command: sudo ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.2.jar sudo ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.2.jar sudo ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar I have set execution engine in the hive-site.xml file as well. I have added the

AWS EMR run spark job/step synchronously

阅读更多关于 AWS EMR run spark job/step synchronously

问题 Is it possible to run/submit Spark Step synchronously? I am trying to run the Spark step on AWS EMR cluster from Java App. I am trying to implement a service that runs the job and returns the results. I am exploring client mode execution of AWS EMR. But not sure if thats the right answer. Any guidance or links would be great help. High level workflow for api endpoint looks like this. /getMeStat -> add step to AWS EMR (EMR runs the job, generate result) -> returns results 来源： https:/

How to convert .net DateTime.Ticks to Hive DateTime in query?

阅读更多关于 How to convert .net DateTime.Ticks to Hive DateTime in query?

问题 I have log file with a column in DateTime.Ticks (635677577653488758) which i am trying to convert it to Date in Hadoop Hive. First i tried the code block below on MySql and it worked. But the same code didn't work in Hive because date_add function works with INT. SELECT DATE_ADD('2001-01-01 00:00:00', INTERVAL (MAX(f.date) - 631139040000000000)/10 MICROSECOND); Then i will format it like this... SELECT DATE_FORMAT(MyDateFromTicks, '%Y-%m-%dT%T.%fZ'); How can i achieve this? Thank you. 回答1: I

Query results difference between EMR-Presto and Athena

阅读更多关于 Query results difference between EMR-Presto and Athena

问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

Slow or incomplete saveAsParquetFile from EMR Spark to S3

阅读更多关于 Slow or incomplete saveAsParquetFile from EMR Spark to S3

问题 I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random . I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of memory. I've tried with the maximum number of executors, and one executor per node. // create some random data for performance and scalability testing val df = sqlContext.range(0,1000).map(x => Row.fromSeq((1 to 100).map(y => math.Random))) df

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

阅读更多关于 Problems using distcp and s3distcp with my EMR job that outputs to HDFS

问题 I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)? For distcp, I run (following this post's recommendation): elastic-mapreduce --jobflow <MY

Using s3 as fs.default.name or HDFS?

阅读更多关于 Using s3 as fs.default.name or HDFS?

问题 I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks? Thanks! 回答1: in order to use S3 instead of HDFS fs.name.default

How does EMR handle an s3 bucket for input and output?

阅读更多关于 How does EMR handle an s3 bucket for input and output?

问题 I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found' 回答1: MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an

Huge delays translating the DAG to tasks

阅读更多关于 Huge delays translating the DAG to tasks

问题 this are my steps: Submit the spark app to a EMR cluster The driver starts and I can see the Spark-ui (no stages have been created yet) The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3 The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui The stages appear and start the execution Why am I getting that huge delay in step 4? During this time the cluster is