Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

前端 未结 3 1062
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-31 18:25

It\'s been a couple of days but I could not download from public Amazon Bucket using Spark :(

Here is spark-shell command:

spark-shell           


        
相关标签:
3条回答
  • 2020-12-31 19:05

    Mmmm.... I found the problem, finally..

    The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:

    hadoop-common-2.7.3.jar
    hadoop-client-2.7.3.jar
    ....

    it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:

    spark-shell
    --master yarn
    -v
    --jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
    --driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

    Did the job !!!

    I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

    0 讨论(0)
  • 2020-12-31 19:09

    I use spark 2.4.5 and this is what I did and it worked for me. I am able to connect to AWS s3 from Spark in my local.

    (1) Download spark 2.4.5 from here:https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz. This spark does not have hadoop in it.
    (2) Download hadoop. https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
    (3) Update .bash_profile
    SPARK_HOME = <SPARK_PATH> #example /home/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12
    PATH=$SPARK_HOME/bin
    (4) Add Hadoop in spark env
    Copy spark-env.sh.template as spark-env.sh
    add export SPARK_DIST_CLASSPATH=$(<hadoop_path> classpath)
    here <hadoop_path> is path to your hadoop /home/hadoop-3.2.1/bin/hadoop
    
    0 讨论(0)
  • 2020-12-31 19:24

    I advise you not to do what you did. You are running pre built spark with hadoop 2.7.2 jars on hadoop 2.9.2 and you added to the classpath some more jars to work with s3 from the hadoop 2.7.3 version to solve the issue.

    What you should be doing is working with a "hadoop free" spark version - and provide the hadoop file by configuration as you can see in the following link - https://spark.apache.org/docs/2.4.0/hadoop-provided.html

    The main parts:

    in conf/spark-env.sh

    If hadoop binary is on your PATH

    export SPARK_DIST_CLASSPATH=$(hadoop classpath)
    

    With explicit path to hadoop binary

    export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
    

    Passing a Hadoop configuration directory

    export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath) 
    
    0 讨论(0)
提交回复
热议问题