Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

ぃ、小莉子 提交于 2019-12-02 04:11:24

I think the problem isn't directly connected to the number of DPUs. You have large file and you are using GZIP format which it’s not splittable therefore you have this problem.

I suggest to convert your file from GZIP to bzip2 or lz4. Additionaly you should consider to use partitioning of output data for better performance in the future.

http://comphadoop.weebly.com/

How many DPUs you are using. This article gives a nice overview of DPU capacity planning. Hope that helps. There is no definite rulebook from AWS stating how much DPU you need to process a particular size.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!