Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

大兔子大兔子 提交于 2019-12-20 04:15:10

问题


There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like

'Command failed with exit code 1'

or

An error occurred while calling o392.pyWriteDynamicFrame. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-5-241.eu-central-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed

. In the metrics monitoring I don't see much of CPU or Memory load. There is ETL data movement but that should be trigger any error when working with S3.

Another problem is that such job runs 4-5 hours before throwing. Is it an expected behavior? CSV files have like 30-40 cols.

I don't know which direction to go. Can Glue overall handle such large files?


回答1:


I think the problem isn't directly connected to the number of DPUs. You have large file and you are using GZIP format which it’s not splittable therefore you have this problem.

I suggest to convert your file from GZIP to bzip2 or lz4. Additionaly you should consider to use partitioning of output data for better performance in the future.

http://comphadoop.weebly.com/




回答2:


How many DPUs you are using. This article gives a nice overview of DPU capacity planning. Hope that helps. There is no definite rulebook from AWS stating how much DPU you need to process a particular size.



来源:https://stackoverflow.com/questions/52614265/using-aws-glue-to-convert-very-big-csv-gz-30-40-gb-each-to-parquet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!