Use Spark fileoutputcommitter.algorithm.version=2 with AWS Glue

问题

I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Is it possible to use this configuration with AWS Glue?

回答1:

Option 1 :

Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe.

sc._jsc.hadoopConfiguration().set("mykey","myvalue")

I think you neeed to add the correspodning class also like this

sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")

example snippet :

 sc = SparkContext()

    sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

    glueContext = GlueContext(sc)
    spark = glueContext.spark_session

To prove that that configuration exists ....

Debug in python :

sc._conf.getAll() // print this

Debug in scala :

sc.getConf.getAll.foreach(println)

Option 2:

Other side you try using job parameters of the glue :

https://docs.aws.amazon.com/glue/latest/dg/add-job.html which has key value properties like mentioned in docs

'--myKey' : 'value-for-myKey'

you can follow below screen shot for editing job and specifying the parameters with --conf

Option 3:
If you are using, aws cli you can try below... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

Fun is they mentioned in the docs dont set message like below. but i dont know why it was exposed.

To sum up : I personally prefer option1 since you have programmatic control.

回答2:

Go to glue job console and edit your job as follows :

Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters

Set the following:

key: --conf value:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

来源：https://stackoverflow.com/questions/56432696/use-spark-fileoutputcommitter-algorithm-version-2-with-aws-glue

标签

scala

amazon-web-services

apache-spark

pyspark

aws-glue