Writing RDD partitions to individual parquet files in its own directory

问题

I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:

    <root>
        <entity=entity1>
            <year=2015>
                <week=45>
                    data_file.parquet

Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.

As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.

Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.

Any help will be appreciated. By the way I am new to this stack.

回答1:

I think it's possible by calling foreachPartition(f: Iterator[T] => Unit) on the RDD you want to save.

In the function you provided into foreachPartition:

Prepare the path hdfs://localhost:9000/parquet_data/year=x/week=y
a ParquetWriter
exhaust the Iterator by inserting each line into the recordWriter.
clean up

回答2:

I don't think the accepted answer appropriately answers the question.

Try something like this:

df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

And you will get the partitioned directory structure.

来源：https://stackoverflow.com/questions/30338213/writing-rdd-partitions-to-individual-parquet-files-in-its-own-directory

标签

scala

apache-spark

apache-spark-sql

rdd

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!