amazon-emr

Cannot connect/query from Presto on AWS EMR with Java JDBC

南笙酒味 提交于 2021-01-28 12:22:20
问题 If I ssh onto the master node of my presto emr cluster, I can run queries. However, I would like to be able to run queries from java source code on my local machine that connects to the emr cluster. I set up my presto emr cluster with default configurations. I have tried port forwarding, but it still does not seem to work. When I create the connection, I print it out and it is "com.facebook.presto.jdbc.PrestoConnection@XXXXXXX" but I still have doubts if it is actually connected since I can't

On AWS, run an AWS CLI command daily

好久不见. 提交于 2021-01-28 10:51:39
问题 I have an AWS CLI invocation (in this case, to launch a configured EMR cluster to do some steps and then shut down) but I'm not sure how to go about running it daily. I guess one way to do it is an EC2 micro instance running a cron job, or an ECS task in a micro that launches the command, but that all seems like it might be overkill. It looks like there's also a way to do it in Lambda, but rom what I can tell it'd be kludgy. This doesn't have to be a good long-term solution, something that's

On AWS, run an AWS CLI command daily

戏子无情 提交于 2021-01-28 10:49:55
问题 I have an AWS CLI invocation (in this case, to launch a configured EMR cluster to do some steps and then shut down) but I'm not sure how to go about running it daily. I guess one way to do it is an EC2 micro instance running a cron job, or an ECS task in a micro that launches the command, but that all seems like it might be overkill. It looks like there's also a way to do it in Lambda, but rom what I can tell it'd be kludgy. This doesn't have to be a good long-term solution, something that's

PySpark UDF optimization challenge

◇◆丶佛笑我妖孽 提交于 2021-01-27 15:01:14
问题 I am trying to optimize the code below. The when run with 1000 lines of data takes about 12 minutes to complete. Our use case would require data sizes to be around 25K - 50K rows which would make this implementation completely infeasible. import pyspark.sql.types as Types import numpy import spacy from pyspark.sql.functions import udf inputPath = "s3://myData/part-*.parquet" df = spark.read.parquet(inputPath) test_df = df.select('uid', 'content').limit(1000).repartition(10) # print(df.rdd

Spark no such field METASTORE_CLIENT_FACTORY_CLASS

瘦欲@ 提交于 2021-01-04 07:02:45
问题 I am trying to query a hive table using spark in Java. My hive tables are in an EMR cluster 5.12. Spark version is 2.2.1 and Hive 2.3.2. When I ssh into the machine and I connect to the spark-shell I am able to query the hive tables with no issues. But when I try to query using a custom jar then I get the following exception: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark

Spark 2.2.0 - How to write/read DataFrame to DynamoDB

本秂侑毒 提交于 2020-12-29 06:23:53
问题 I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB. Read the table into a DataFrame Right now, I can read the table from DynamoDB into Spark as a hadoopRDD and convert it to a DataFrame. However, I had to use a regular expression to extract the value from AttributeValue . Is there a better/more elegant way? Couldn't find anything in the AWS API. package main.scala.util import org.apache.spark.sql.SparkSession import org.apache.spark

Spark 2.2.0 - How to write/read DataFrame to DynamoDB

こ雲淡風輕ζ 提交于 2020-12-29 06:23:51
问题 I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB. Read the table into a DataFrame Right now, I can read the table from DynamoDB into Spark as a hadoopRDD and convert it to a DataFrame. However, I had to use a regular expression to extract the value from AttributeValue . Is there a better/more elegant way? Couldn't find anything in the AWS API. package main.scala.util import org.apache.spark.sql.SparkSession import org.apache.spark

Spark 2.2.0 - How to write/read DataFrame to DynamoDB

百般思念 提交于 2020-12-29 06:22:05
问题 I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB. Read the table into a DataFrame Right now, I can read the table from DynamoDB into Spark as a hadoopRDD and convert it to a DataFrame. However, I had to use a regular expression to extract the value from AttributeValue . Is there a better/more elegant way? Couldn't find anything in the AWS API. package main.scala.util import org.apache.spark.sql.SparkSession import org.apache.spark

S3Guard and parquet magic commiter for S3A on EMR 6.x

六眼飞鱼酱① 提交于 2020-12-27 07:12:39
问题 We are using CDH 5.13 with Spark 2.3.0 and S3Guard. After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation. According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A. We tested EMR-5.15.1 and got the same results as on Hadoop. If I am trying to use Magic Commiter I am getting py4j.protocol.Py4JJavaError: An error occurred while calling o72.save. : java

S3Guard and parquet magic commiter for S3A on EMR 6.x

瘦欲@ 提交于 2020-12-27 07:02:30
问题 We are using CDH 5.13 with Spark 2.3.0 and S3Guard. After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation. According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A. We tested EMR-5.15.1 and got the same results as on Hadoop. If I am trying to use Magic Commiter I am getting py4j.protocol.Py4JJavaError: An error occurred while calling o72.save. : java