Glue Job fails to write file

烈酒焚心 提交于 2019-12-13 03:39:01

问题


I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?


回答1:


The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.



来源:https://stackoverflow.com/questions/57061213/glue-job-fails-to-write-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!