Spark Redshift with Python

问题

I'm trying to connect Spark with amazon Redshift but i'm getting this error :

My code is as follow :

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)

df = sql_context.read \
    .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
    .option("dbtable", "table_name") \
    .option("tempdir", "bucket") \
    .load()

回答1:

Here is a step by step process for connecting to redshift.

Download the redshift connector file . try the below command

wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"

save the below code in a python file(.py that you want to run) and replace the credentials accordingly.

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

#initialize the spark session 
spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)

sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")


taxonomyDf = sqlContext.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \
    .option("dbtable", "table_name") \
    .option("tempdir", "s3://mybucket/") \
    .load()

run the spark-submit like below

spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py

回答2:

The error is due to missing dependencies.

Verify that you have these jar files in the spark home directory:

spark-redshift_2.10-3.0.0-preview1.jar
RedshiftJDBC41-1.1.10.1010.jar
hadoop-aws-2.7.1.jar
aws-java-sdk-1.7.4.jar
(aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar

(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)

df = sql_context.read \
     .format("com.databricks.spark.redshift") \
     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
     .option("dbtable", "table_name") \
     .option('forward_spark_s3_credentials',True) \
     .option("tempdir", "s3n://bucket") \
     .load()

Common errors afterwards are:

Redshift Connection Error: "SSL off"
- Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
- Solution: The bucket & cluster must be run within the same region

回答3:

If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-

1) Download the Redshift JDBC jar by using the below command:-

wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar

Refernce:- AWS Document

2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:-

import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")

sqlCon = SQLContext(spark)
df = sqlCon.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

df.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
  .option("dbtable", "TABLE_NAME") \
  .option("aws_region", "us-west-1") \
  .option("tempdir", "s3://BUCKET_NAME/PATH/") \
  .mode("error") \
  .save()

3) Run the below spark-submit command:-

spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py

Note:-

1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.

2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1.

3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here under configuration.

回答4:

if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:

from pyspark.sql import SQLContext
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

df = sqlContext.read \ .......

Maybe the bucket is not mounted

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

回答5:

I think the s3n:// URL style has been deprecated and/or removed.

Try defining your keys as "fs.s3.awsAccessKeyId".

回答6:

I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230

来源：https://stackoverflow.com/questions/38307640/spark-redshift-with-python

标签

apache-spark

amazon-redshift

databricks