How to combine small parquet files to one large parquet file? [duplicate]

问题

I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.

回答1:

You can read the whole data, repartition by the partitions you have and then write using the partitionBy (this is how you should also save them in the future). Something like:

spark\
    .read\
    .parquet('...'))\
    .repartition('key1', 'key2',...)\
    .write\
    .partitionBy('key1', 'key2',...)\
    .option('path', target_part)\
    .saveAsTable('partitioned')

来源：https://stackoverflow.com/questions/51874225/how-to-combine-small-parquet-files-to-one-large-parquet-file

标签

apache-spark

Hive

pyspark

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!