amazon-emr

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

笑着哭i 提交于 2021-02-17 05:33:34
问题 I have following simple Scala class , which i will later modify to fit some machine learning models. I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below. The csv file looks like this and its include a Date column as one of the variables. +-------------------+-------

Reading from S3 in EMR

萝らか妹 提交于 2021-02-11 15:23:17
问题 I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR. I have read quite a few posts about it and have done the following to make it works : Add an IAM policy allowing read & write access to s3 Tried to pass the uris in the Argument section of the spark-submit request I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run : df = spark.read.option(

Pyspark EMR Conda issue

梦想与她 提交于 2021-02-11 14:46:28
问题 i am trying to run a spark script on EMR with custom conda env, 1. created a booststap for conda setup and supplied to the EMR, i don't see any issues with bootstrap but when i do spark-submit it gives me same error no sure what am i missing Traceback (most recent call last): File "/mnt/tmp/spark-b334133c-d22d-42d4-beba-b85fffbbc9c7/iris_cube_analysis.py", line 3, in <module> import iris ImportError: No module named iris spark-submit - spark-submit --deploy-mode client --master yarn --conf

Unable to locate hive jars to connect to metastore : while using pyspark job to connect to athena tables

坚强是说给别人听的谎言 提交于 2021-02-11 13:19:44
问题 We are using sagemaker instance to connect to EMR in AWS. We are having some pyspark scripts that unloads athena tables and processes them as part of pipeline. We are using athena tables using glue catalog but when we try to run the job via spark submit, our job fails Code snippet from pyspark import SparkContext, SparkConf from pyspark.context import SparkContext from pyspark.sql import Row, SQLContext, SparkSession import pyspark.sql.dataframe def process_data(): conf = SparkConf()

RDS to S3 - Data Transformation AWS

半城伤御伤魂 提交于 2021-02-11 12:31:32
问题 I have about 30 tables in my RDS postgres / oracle (haven't decided if it is oracle or postgres yet) instance. I want to fetch all the records that have been inserted / updated in the last 4 hours (configurable) , create a csv file pertaining to each table and store the files in S3. I want this whole process to be transactional. If there is any error in fetching data from one table , I don't want data pertinent to other 29 tables to be persisted in S3. The data isn't very large , it should be

How can I configure spark so that it creates “_$folder$” entries in S3?

青春壹個敷衍的年華 提交于 2021-02-10 14:39:47
问题 When I write my dataframe to S3 using df.write .format("parquet") .mode("overwrite") .partitionBy("year", "month", "day", "hour", "gen", "client") .option("compression", "gzip") .save("s3://xxxx/yyyy") I get the following in S3 year=2018 year=2019 but I would like to have this instead: year=2018 year=2018_$folder$ year=2019 year=2019_$folder$ The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate

EMR 5.21 , Spark 2.4 - Json4s Dependency broken

删除回忆录丶 提交于 2021-02-09 20:57:35
问题 Issue In EMR 5.21 , Spark - Hbase integration is broken. df.write.options().format().save() fails. Reason is json4s-jackson version 3.5.3 in spark 2.4 , EMR 5.21 it works fine in EMR 5.11.2 , Spark 2.2 , son4s-jackson version 3.2.11 Problem is this is EMR so i cant rebuild spark with lower json4s . is there any workaround ? Error py4j.protocol.Py4JJavaError: An error occurred while calling o104.save. : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z

EMR 5.21 , Spark 2.4 - Json4s Dependency broken

霸气de小男生 提交于 2021-02-09 20:45:01
问题 Issue In EMR 5.21 , Spark - Hbase integration is broken. df.write.options().format().save() fails. Reason is json4s-jackson version 3.5.3 in spark 2.4 , EMR 5.21 it works fine in EMR 5.11.2 , Spark 2.2 , son4s-jackson version 3.2.11 Problem is this is EMR so i cant rebuild spark with lower json4s . is there any workaround ? Error py4j.protocol.Py4JJavaError: An error occurred while calling o104.save. : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z

How to set Jupyter notebook to Python3 instead of Python2.7 in AWS EMR

倖福魔咒の 提交于 2021-02-08 10:58:33
问题 I am spinning up an EMR in AWS. The difficulty arises when using Jupyter to import associated Python modules. I have a shell script that executes when the EMR starts and imports Python modules. The notebook is set to run using the PySpark Kernel. I believe the problem is that the Jupyter notebook is not pointed to the correct Python in EMR. The methods I have used to set the notebook to the correct version do not seem to work. I have set the following configurations. I have tried changing

How to set Jupyter notebook to Python3 instead of Python2.7 in AWS EMR

大兔子大兔子 提交于 2021-02-08 10:58:30
问题 I am spinning up an EMR in AWS. The difficulty arises when using Jupyter to import associated Python modules. I have a shell script that executes when the EMR starts and imports Python modules. The notebook is set to run using the PySpark Kernel. I believe the problem is that the Jupyter notebook is not pointed to the correct Python in EMR. The methods I have used to set the notebook to the correct version do not seem to work. I have set the following configurations. I have tried changing