how to merge multiple parquet files to single parquet file using linux or hdfs command?

心已入冬 提交于 2019-12-03 00:58:22
giaosudau

According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you can download the source code and compile parquet-tools which is built in merge command.

java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
        /output_idr/file_name

Or using a tool like https://github.com/stripe/herringbone

You can also do it using HiveQL itself, if your execution engine is mapreduce.

You can set a flag for your query, which causes hive to merge small files at the end of your job:

SET hive.merge.mapredfiles=true;

or

SET hive.merge.mapfiles=true;

if your job is a map-only job.

This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!