How does Structured Streaming ensure exactly-once writing semantics for file sinks?

前端 未结 2 455
情书的邮戳
情书的邮戳 2021-01-15 22:41

I am writing a storage writer for spark structured streaming which will partition the given dataframe and write to a different blob store account. The spark documentation sa

2条回答
  •  情书的邮戳
    2021-01-15 22:57

    When you use foreachBatch, spark guarantee only that foreachBatch will call only one time. But if you will have exception during execution foreachBatch, spark will try to call it again for same batch. In this case we can have duplication if we store to multiple storages and have exception during storing. So you can manually handle exception during storing for avoid duplication.

    In my practice I created custom sink if need to store to multiple storage and use datasource api v2 which support commit.

提交回复
热议问题