Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

问题

There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like

'Command failed with exit code 1'

An error occurred while calling o392.pyWriteDynamicFrame. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-5-241.eu-central-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed

. In the metrics monitoring I don't see much of CPU or Memory load. There is ETL data movement but that should be trigger any error when working with S3.

Another problem is that such job runs 4-5 hours before throwing. Is it an expected behavior? CSV files have like 30-40 cols.

I don't know which direction to go. Can Glue overall handle such large files?

回答1:

I think the problem isn't directly connected to the number of DPUs. You have large file and you are using GZIP format which it’s not splittable therefore you have this problem.

I suggest to convert your file from GZIP to bzip2 or lz4. Additionaly you should consider to use partitioning of output data for better performance in the future.

http://comphadoop.weebly.com/

回答2:

How many DPUs you are using. This article gives a nice overview of DPU capacity planning. Hope that helps. There is no definite rulebook from AWS stating how much DPU you need to process a particular size.

来源：https://stackoverflow.com/questions/52614265/using-aws-glue-to-convert-very-big-csv-gz-30-40-gb-each-to-parquet

标签

amazon-web-services

aws-glue