可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to connect Spark with amazon Redshift but i'm getting this error :

My code is as follow :

from pyspark.sql import SQLContext from pyspark import SparkContext  sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)  df = sql_context.read \     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \     .option("dbtable", "table_name") \     .option("tempdir", "bucket") \     .load()

回答1:

Here is a step by step process for connecting to redshift.

Download the redshift connector file . try the below command

wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"

save the below code in a python file(.py that you want to run) and replace the credentials accordingly.

from pyspark.conf import SparkConf from pyspark.sql import SparkSession  #initialize the spark session  spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlContext = HiveContext(sc)  sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>") sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>")   taxonomyDf = sqlContext.read \     .format("com.databricks.spark.redshift") \     .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \     .option("dbtable", "table_name") \     .option("tempdir", "s3://mybucket/") \     .load()

run the spark-submit like below

spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py

回答2:

The error is due to missing dependencies.

Verify that you have these jar files in the spark home directory:

spark-redshift_2.10-3.0.0-preview1.jar
RedshiftJDBC41-1.1.10.1010.jar
hadoop-aws-2.7.1.jar
aws-java-sdk-1.7.4.jar
(aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)

Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar

(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext from pyspark import SparkContext  sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)  df = sql_context.read \      .format("com.databricks.spark.redshift") \      .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \      .option("dbtable", "table_name") \      .option('forward_spark_s3_credentials',True) \      .option("tempdir", "s3n://bucket") \      .load()

Common errors afterwards are:

Redshift Connection Error: "SSL off"
- Solution: .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
- Solution: The bucket & cluster must be run within the same region

回答3:

if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:

from pyspark.sql import SQLContext     sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")     sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")  df = sqlContext.read \ .......

Maybe the bucket is not mounted

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

回答4:

I think the s3n:// URL style has been deprecated and/or removed.

Try defining your keys as "fs.s3.awsAccessKeyId".

回答5:

I think that you need to add .format("com.databricks.spark.redshift") to your sql_context.read call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift connector.

For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230

文章来源: Spark Redshift with Python

标签

Redshift

spark

python