Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

前端未结

关注

 3  1062

It\'s been a couple of days but I could not download from public Amazon Bucket using Spark :(

Here is spark-shell command:

spark-shell


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-31 19:05
              
            
            
                                                                       
Mmmm.... I found the problem, finally..

The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:


  hadoop-common-2.7.3.jar

  hadoop-client-2.7.3.jar

   ....


it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:


  spark-shell

    --master yarn

    -v

    --jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar

    --driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar  


Did the job !!!

I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-12-31 19:09
              
            
            
                                                                       
I use spark 2.4.5 and this is what I did and it worked for me. I am able to connect to AWS s3 from Spark in my local.
(1) Download spark 2.4.5 from here:https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz. This spark does not have hadoop in it.
(2) Download hadoop. https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(3) Update .bash_profile
SPARK_HOME = <SPARK_PATH> #example /home/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12
PATH=$SPARK_HOME/bin
(4) Add Hadoop in spark env
Copy spark-env.sh.template as spark-env.sh
add export SPARK_DIST_CLASSPATH=$(<hadoop_path> classpath)
here <hadoop_path> is path to your hadoop /home/hadoop-3.2.1/bin/hadoop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-31 19:24
              
            
            
                                                                       
I advise you not to do what you did. 
You are running pre built spark with hadoop 2.7.2 jars on hadoop 2.9.2 and you added to the classpath some more jars to work with s3 from the hadoop 2.7.3 version to solve the issue. 

What you should be doing is working with a "hadoop free" spark version - and provide the hadoop file by configuration as you can see in the following link - 
https://spark.apache.org/docs/2.4.0/hadoop-provided.html

The main parts: 

in conf/spark-env.sh 

If hadoop binary is on your PATH

export SPARK_DIST_CLASSPATH=$(hadoop classpath)


With explicit path to hadoop binary

export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)


Passing a Hadoop configuration directory

export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath) 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复