Spark Redshift with Python

后端 未结 6 1116
梦如初夏
梦如初夏 2021-01-03 10:39

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f         


        
6条回答
  •  不知归路
    2021-01-03 11:33

    If you are using Spark 2.0.4 and running your code on AWS EMR cluster, then please follow the below steps:-

    1) Download the Redshift JDBC jar by using the below command:-

    wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar
    

    Refernce:- AWS Document

    2) Copy the below-mentioned code in a python file and then replace the required values with your AWS resource:-

    import pyspark
    from pyspark.sql import SQLContext
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    spark._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "access key")
    spark._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "secret access key")
    
    sqlCon = SQLContext(spark)
    df = sqlCon.createDataFrame([
        (1, "A", "X1"),
        (2, "B", "X2"),
        (3, "B", "X3"),
        (1, "B", "X3"),
        (2, "C", "X2"),
        (3, "C", "X2"),
        (1, "C", "X1"),
        (1, "B", "X1"),
    ], ["ID", "TYPE", "CODE"])
    
    df.write \
      .format("com.databricks.spark.redshift") \
      .option("url", "jdbc:redshift://HOST_URL:5439/DATABASE_NAME?user=USERID&password=PASSWORD") \
      .option("dbtable", "TABLE_NAME") \
      .option("aws_region", "us-west-1") \
      .option("tempdir", "s3://BUCKET_NAME/PATH/") \
      .mode("error") \
      .save()
    

    3) Run the below spark-submit command:-

    spark-submit --name "App Name" --jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 --py-files python_script.py python_script.py
    

    Note:-

    1) The Public IP address of the EMR node (on which the spark-submit job will run) should be allowed in the inbound rule of security group of Reshift cluster.

    2) Redshift cluster and the S3 location used under "tempdir" should be there in the same geo-location. Here in the above example, both the resources are in us-west-1.

    3) If the data is sensitive then do make sure to secure all the channels. To make the connections secure please follow the steps mentioned here under configuration.

提交回复
热议问题