Spark: PartitionBy, change output file name

Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id")

I will get output structure looking like (which is the default behaviour)

../id=1/

../id=2/

../id=3/

I would like a structure looking like:

../a/

../b/

../c/

such that

if id = 1, then a
if id = 2, then b

.. etc

Is there a way to change the filename output? If not What is the best way to do this?

You won't be able to use Spark's partitionBy to achieve this.

Instead, you have to break your DataFrame into its component partitions, and save them one by one, like so:

base = ord('a') - 1
for id in range(1, 4):
    DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}

Alternatively, you can write the entire dataframe using Spark's partitionBy facility, and then manually rename the partitions using HDFS APIs.

来源：https://stackoverflow.com/questions/45154696/spark-partitionby-change-output-file-name

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!