Spark Direct Stream Kafka order of events

社会主义新天地 提交于 2019-12-11 23:46:08

问题


I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase.

I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ).

The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered at partition level.

My problem begins when I start reading using Spark.

Using the direct stream approach, each executor will read from one partition. I am not doing any shuffling of data (just parse and save), so my events won't get messed up among the RDD, but I am worried that when the executor reads the partition, it won't maintain the order so I will end up with incorrect data in HBase when I save them.

How can I ensure that the order is kept at executor level, especially if I use multiple cores in one executor (which from my understanding result in multiple threads)?

I think I can also live with 1 core if this fixes the issue and by turning off speculative execution, enabling spark back pressure optimizations and keeping the maximum retries on executor to 1.

I have also thought about implementing a sort on the events at spark partition level using the Kafka offset.

Any advice?

Thanks a lot in advance!

来源:https://stackoverflow.com/questions/50275956/spark-direct-stream-kafka-order-of-events

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!