How to deal with large number of parquet files

℡╲_俬逩灬. 提交于 2019-12-06 14:18:06

问题


I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.

Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?

UPDATE 1: Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?


回答1:


Take a look at this GitHub repo and this answer. In short keep size of the files larger than HDFS block size (128MB, 256MB).




回答2:


A good way to reduce the number of output files is to use coalesce or repartition.



来源:https://stackoverflow.com/questions/45058368/how-to-deal-with-large-number-of-parquet-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!