问题
I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
Is it possible to use this configuration with AWS Glue?
回答1:
Option 1 :
Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe.
sc._jsc.hadoopConfiguration().set("mykey","myvalue")
I think you neeed to add the correspodning class also like this
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
example snippet :
sc = SparkContext()
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
glueContext = GlueContext(sc)
spark = glueContext.spark_session
To prove that that configuration exists ....
Debug in python :
sc._conf.getAll() // print this
Debug in scala :
sc.getConf.getAll.foreach(println)
Option 2:
Other side you try using job parameters of the glue :
https://docs.aws.amazon.com/glue/latest/dg/add-job.html which has key value properties like mentioned in docs
'--myKey' : 'value-for-myKey'
you can follow below screen shot for editing job and specifying the parameters with --conf
Option 3:
If you are using, aws cli you can try below...
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
Fun is they mentioned in the docs dont set message like below. but i dont know why it was exposed.
To sum up : I personally prefer option1 since you have programmatic control.
回答2:
Go to glue job console and edit your job as follows :
Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters
Set the following:
key: --conf value:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
来源:https://stackoverflow.com/questions/56432696/use-spark-fileoutputcommitter-algorithm-version-2-with-aws-glue