How to change the location of _spark_metadata directory?

风流意气都作罢 提交于 2019-12-01 08:49:12

My understanding is that it is not possible up to Spark 2.3.

  1. The name of the metadata directory is always _spark_metadata

  2. _spark_metadata directory is always at the location where path option points to

I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.

Internals

The flow is that DataSource is requested to create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath where the results are written to as well as the metadata.

You can find the initial commit quite useful to understand the purpose of the metadata directory.

In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!