Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

前端 未结 3 1071
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-31 18:25

It\'s been a couple of days but I could not download from public Amazon Bucket using Spark :(

Here is spark-shell command:

spark-shell           


        
3条回答
  •  清歌不尽
    2020-12-31 19:05

    Mmmm.... I found the problem, finally..

    The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:

    hadoop-common-2.7.3.jar
    hadoop-client-2.7.3.jar
    ....

    it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:

    spark-shell
    --master yarn
    -v
    --jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
    --driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

    Did the job !!!

    I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

提交回复
热议问题