Spark Redshift with Python

后端未结

关注

 6  1121

梦如初夏 2021-01-03 10:39

I\'m trying to connect Spark with amazon Redshift but i\'m getting this error :

My code is as follow :

from pyspark.sql import SQLContext
f


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   余生分开走
                                             
                
                
                (楼主)
            
              
              
                2021-01-03 11:35
              

            
            
                        
The error is due to missing dependencies. 

Verify that you have these jar files in the spark home directory:


spark-redshift_2.10-3.0.0-preview1.jar
RedshiftJDBC41-1.1.10.1010.jar
hadoop-aws-2.7.1.jar
aws-java-sdk-1.7.4.jar
(aws-java-sdk-s3-1.11.60.jar) (newer version but not everything worked with it)


Put these jar files in $SPARK_HOME/jars/ and then start spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar


(SPARK_HOME should be = "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

This will run Spark with all necessary dependencies. Note that you also need to specify the authentication type 'forward_spark_s3_credentials'=True if you are using awsAccessKeys.

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", )
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", )

df = sql_context.read \
     .format("com.databricks.spark.redshift") \
     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
     .option("dbtable", "table_name") \
     .option('forward_spark_s3_credentials',True) \
     .option("tempdir", "s3n://bucket") \
     .load()


Common errors afterwards are: 


Redshift Connection Error: "SSL off" 


Solution:
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")

S3 Error: When unloading the data, e.g. after df.show() you get the message: "The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."


Solution: The bucket & cluster must be run within the same region


    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复