Spark Redshift with Python

后端 未结 6 1121
梦如初夏
梦如初夏 2021-01-03 10:39

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f         


        
6条回答
  •  余生分开走
    2021-01-03 11:35

    The error is due to missing dependencies.

    Verify that you have these jar files in the spark home directory:

    1. spark-redshift_2.10-3.0.0-preview1.jar
    2. RedshiftJDBC41-1.1.10.1010.jar
    3. hadoop-aws-2.7.1.jar
    4. aws-java-sdk-1.7.4.jar
    5. (aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

    Put these jar files in $SPARK_HOME/jars/ and then start spark

    pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar
    

    (SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

    This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

    from pyspark.sql import SQLContext
    from pyspark import SparkContext
    
    sc = SparkContext(appName="Connect Spark with Redshift")
    sql_context = SQLContext(sc)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", )
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", )
    
    df = sql_context.read \
         .format("com.databricks.spark.redshift") \
         .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
         .option("dbtable", "table_name") \
         .option('forward_spark_s3_credentials',True) \
         .option("tempdir", "s3n://bucket") \
         .load()
    

    Common errors afterwards are:

    • Redshift Connection Error: "SSL off"
      • Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
    • S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
      • Solution: The bucket & cluster must be run within the same region

提交回复
热议问题