Spark CSV 2.1 File Names

大城市里の小女人 提交于 2021-02-07 08:32:26

问题


i'm trying to save DataFrame into CSV using the new spark 2.1 csv option

 df.select(myColumns: _*).write
                  .mode(SaveMode.Overwrite)
                  .option("header", "true")
                  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
                  .csv(absolutePath)

everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix

i.e
part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz

Anyone knows how i can remove this file ext and stay only with part-000XX convension

Thanks


回答1:


You can remove the UUID by overriding the configuration option "spark.sql.sources.writeJobUUID":

https://github.com/apache/spark/commit/0818fdec3733ec5c0a9caa48a9c0f2cd25f84d13#diff-c69b9e667e93b7e4693812cc72abb65fR75

Unfortunately this solution will not fully mirror the old saveAsTextFile style (i.e. part-00000), but could make the output file name more sane such as part-00000-output.csv.gz where "output" is the value you pass to spark.sql.sources.writeJobUUID. The "-" is automatically appended

SPARK-8406 is the relevant Spark issue and here's the actual Pull Request: https://github.com/apache/spark/pull/6864



来源:https://stackoverflow.com/questions/42870726/spark-csv-2-1-file-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!