可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to connect Spark with amazon Redshift but i'm getting this error :
My code is as follow :
from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>) df = sql_context.read \ .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \ .option("dbtable", "table_name") \ .option("tempdir", "bucket") \ .load()
回答1:
Here is a step by step process for connecting to redshift.
- Download the redshift connector file . try the below command
wget "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.2.1.1001.jar"
- save the below code in a python file(.py that you want to run) and replace the credentials accordingly.
from pyspark.conf import SparkConf from pyspark.sql import SparkSession #initialize the spark session spark = SparkSession.builder.master("yarn").appName("Connect to redshift").enableHiveSupport().getOrCreate() sc = spark.sparkContext sqlContext = HiveContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "<ACCESSKEYID>") sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "<ACCESSKEYSECTRET>") taxonomyDf = sqlContext.read \ .format("com.databricks.spark.redshift") \ .option("url", "jdbc:postgresql://url.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") \ .option("dbtable", "table_name") \ .option("tempdir", "s3://mybucket/") \ .load()
- run the spark-submit like below
spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar test.py
回答2:
The error is due to missing dependencies.
Verify that you have these jar files in the spark home directory:
- spark-redshift_2.10-3.0.0-preview1.jar
- RedshiftJDBC41-1.1.10.1010.jar
- hadoop-aws-2.7.1.jar
- aws-java-sdk-1.7.4.jar
- (aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)
Put these jar files in $SPARK_HOME/jars/ and then start spark
pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar
(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")
This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.
from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext(appName="Connect Spark with Redshift") sql_context = SQLContext(sc) sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>) df = sql_context.read \ .format("com.databricks.spark.redshift") \ .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \ .option("dbtable", "table_name") \ .option('forward_spark_s3_credentials',True) \ .option("tempdir", "s3n://bucket") \ .load()
Common errors afterwards are:
- Redshift Connection Error: "SSL off"
- Solution:
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
- S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."
- Solution: The bucket & cluster must be run within the same region
回答3:
if you are using databricks, I think you don't have to create a new sql Context because they do that for you just have to use sqlContext, try with this code:
from pyspark.sql import SQLContext sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID") sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY") df = sqlContext.read \ .......
Maybe the bucket is not mounted
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
回答4:
I think the s3n://
URL style has been deprecated and/or removed.
Try defining your keys as "fs.s3.awsAccessKeyId"
.
回答5:
I think that you need to add .format("com.databricks.spark.redshift")
to your sql_context.read
call; my hunch is that Spark can't infer the format for this data source, so you need to explicitly specify that we should use the spark-redshift
connector.
For more detail on this error, see https://github.com/databricks/spark-redshift/issues/230