How To Get Local Spark on AWS to Write to S3

后端 未结 1 1351
时光取名叫无心
时光取名叫无心 2021-01-01 04:06

I have installed Spark 2.4.3 with Hadoop 3.2 on an AWS EC2 instance. I’ve been using spark (mainly pyspark) in local mode with great success. It is nice to be able to spin u

相关标签:
1条回答
  • 2021-01-01 04:53

    I helped @brettc with his configuration and we found out the correct one to set.

    Under $SPARK_HOME/conf/spark-defaults.conf

    # Enable S3 file system to be recognise
    spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
    
    # Parameters to use new commiters
    spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    spark.hadoop.fs.s3a.committer.name directory
    spark.hadoop.fs.s3a.committer.magic.enabled false
    spark.hadoop.fs.s3a.commiter.staging.conflict-mode replace
    spark.hadoop.fs.s3a.committer.staging.unique-filenames true
    spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true
    spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
    spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
    spark.sql.parquet.output.committer.class     org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
    

    If you look at the last 2 configurations lines above you see that you need org.apache.spark.internal.io library which contains PathOutputCommitProtocol and BindingParquetOutputCommitter classes. To do so you have to download spark-hadoop-cloud jar here (in our case we took version 2.3.2.3.1.0.6-1) and place it under $SPARK_HOME/jars/.

    You can easily verify that you are using the new committer by creating a parquet file. The _SUCCESS file should contains a json like the one below:

    {
      "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
      "timestamp" : 1574729145842,
      "date" : "Tue Nov 26 00:45:45 UTC 2019",
      "hostname" : "<hostname>",
      "committer" : "directory",
      "description" : "Task committer attempt_20191125234709_0000_m_000000_0",
      "metrics" : { [...] },
      "diagnostics" : { [...] },
      "filenames" : [...]
    }
    
    0 讨论(0)
提交回复
热议问题