问题
Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id")
I will get output structure looking like (which is the default behaviour)
../id=1/
../id=2/
../id=3/
I would like a structure looking like:
../a/
../b/
../c/
such that
if id = 1, then a
if id = 2, then b
.. etc
Is there a way to change the filename output? If not What is the best way to do this?
回答1:
You won't be able to use Spark's partitionBy
to achieve this.
Instead, you have to break your DataFrame
into its component partitions, and save them one by one, like so:
base = ord('a') - 1
for id in range(1, 4):
DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}
Alternatively, you can write the entire dataframe using Spark's partitionBy
facility, and then manually rename the partitions using HDFS APIs.
来源:https://stackoverflow.com/questions/45154696/spark-partitionby-change-output-file-name