Spark Redshift with Python

匿名 (未验证) 提交于 2019-12-03 10:24:21

问题:

I'm trying to connect Spark with amazon Redshift but i'm getting this error :

My code is as follow :

from pyspark.sql import SQLContext from pyspark import SparkContext  sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)  df = sql_context.read \     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \     .option("dbtable", "table_name") \     .option("tempdir", "bucket") \     .load() 

回答1:

Here is a step by step process for connecting to redshift.

  • Download the redshift connector file . try the below command
wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar" 
  • save the below code in a python file(.py that you want to run) and replace the credentials accordingly.
from pyspark.conf import SparkConf from pyspark.sql import SparkSession  #initialize the spark session  spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlContext = HiveContext(sc)  sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>") sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")   taxonomyDf = sqlContext.read \     .format("com.databricks.spark.redshift") \     .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \     .option("dbtable", "table_name") \     .option("tempdir", "s3://mybucket/") \     .load()  
  • run the spark-submit like below
spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py 


回答2:

The error is due to missing dependencies.

Verify that you have these jar files in the spark home directory:

  1. spark-redshift_2.10-3.0.0-preview1.jar
  2. RedshiftJDBC41-1.1.10.1010.jar
  3. hadoop-aws-2.7.1.jar
  4. aws-java-sdk-1.7.4.jar
  5. (aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar 

(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext from pyspark import SparkContext  sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)  df = sql_context.read \      .format("com.databricks.spark.redshift") \      .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \      .option("dbtable", "table_name") \      .option('forward_spark_s3_credentials',True) \      .option("tempdir", "s3n://bucket") \      .load() 

Common errors afterwards are:

  • Redshift Connection Error: "SSL off"
    • Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
  • S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
    • Solution: The bucket & cluster must be run within the same region


回答3:

if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:

from pyspark.sql import SQLContext     sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")     sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")  df = sqlContext.read \ ....... 

Maybe the bucket is not mounted

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME) 


回答4:

I think the s3n:// URL style has been deprecated and/or removed.

Try defining your keys as "fs.s3.awsAccessKeyId".



回答5:

I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!