Collecting Parquet data from HDFS to local file system

梦想的初衷 提交于 2019-12-11 04:03:31

问题


Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files..


回答1:


There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.

spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")       
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")

bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath

Since Spark 1.4 it's better to use DataFrame::coalesce(1) instead of DataFrame::repartition(1)




回答2:


you may use pig

A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');

You may create Impala table on to it, & then use

impala-shell -e "query" -o <output>

same way you may use Mapreduce as well




回答3:


You may use parquet tools java -jar parquet-tools.jar merge source/ target/



来源:https://stackoverflow.com/questions/31108123/collecting-parquet-data-from-hdfs-to-local-file-system

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!