How can I configure spark so that it creates “_$folder$” entries in S3?

青春壹個敷衍的年華 提交于 2021-02-10 14:39:47

问题


When I write my dataframe to S3 using

df.write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("year", "month", "day", "hour", "gen", "client")
  .option("compression", "gzip")
  .save("s3://xxxx/yyyy")

I get the following in S3

year=2018
year=2019

but I would like to have this instead:

year=2018
year=2018_$folder$
year=2019
year=2019_$folder$

The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate them.

Any idea on what hadoop or spark configuration setting control the generation of *_$folder$ files?


回答1:


those markers a legacy feature; I don't think anything creates them any more...though they are often ignored when actually listing directories. (that is, even if there, they get stripped from listings and replaced with directory entries).



来源:https://stackoverflow.com/questions/55693083/how-can-i-configure-spark-so-that-it-creates-folder-entries-in-s3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!