What is the difference between mini-batch vs real time streaming in practice (not theory)?

只谈情不闲聊 提交于 2019-12-02 23:31:28
Fabian Hueske

Disclaimer: I'm a committer and PMC member of Apache Flink. I'm familiar with the overall design of Spark Streaming but do not know its internals in detail.

The mini-batch stream processing model as implemented by Spark Streaming works as follows:

  • Records of a stream are collected in a buffer (mini-batch).
  • Periodically, the collected records are processed using a regular Spark job. This means, for each mini-batch a complete distributed batch processing job is scheduled and executed.
  • While the job runs, the records for the next batch are collected.

So, why is it not effective to run a mini-batch every 1ms? Simply because this would mean to schedule a distributed batch job every millisecond. Even though Spark is very fast in scheduling jobs, this would be a bit too much. It would also significantly reduce the possible throughput. Batching techniques used in OSs or TCP do also not work well if their batches become too small.

I know that one answer was accepted, but I think one more must be said to answer this question fully. I think answer like "Flink's real time is faster/better for streaming" is wrong, because it heavily depends what you want to do.

Spark mini-batch model has - as it was written in previous answer - disadvantage, that for each mini-batch there must be new job created.

However, Spark Structured Streaming has default processing time trigger is set to 0, that means reading new data is done as fast as possible. It means that:

  1. one query starts
  2. data arrived, but 1st query didn't end
  3. 1st query ended, so data will be immediatelly processed.

Latency is very small in such cases.

One big advantage over Flink is that Spark has unified APIs for batch and streaming processing, because of this mini-batch model. You can easily translate batch job to streaming job, join streaming data with old data from batch. Doing it with Flink is not possible. Flink also doesn't allow you to do interactive queries with data you've received.

As said before, use cases are different for micro-batches and real-time streaming:

  1. For very very small latencies, Flink or some computional Grids, like Apache Ignite, will be good. They are suitable for processing with very low latency, but not with very complex computations.
  2. For medium and larger latencies, Spark will have more unified API that will allow to do more complex computations in the same way that batch jobs are done, just because of this unification

For more details about Structured Streaming please look at this blog post

This is something I think a lot about, because the answer to technical and non-technical people is always hard to formulate.

I will try to answer to this part:

Why is it not effective to run mini-batch with 1 millisecond latency?

I believe the problem is not on the model itself but on how Spark implements it. It is empirical evidence that reducing the mini-batch window too much, performances degrade. In fact there was a suggested time of at least 0.5 seconds or more to prevent this kind of degradation. On big volumes even this window size was too small. I never had the chance to test it in production but I never had a strong real-time requirement.

I know Flink better than Spark so I don't really know about its internals that well but I believe the overhead introduced in the designing of the batch process were irrelevant if your batch takes at least a few seconds to be processed but becomes heavy if they introduce a fixed latency and you can't go below that. To understand the nature of these overheads I think you will have to dig in the Spark documentation, code and open issues.

The industry right now acknowledged that there is a need for a different model and that's why many "streaming-first" engines are growing right now, with Flink as the front runner. I don't think it's just buzzwords and hype, because the use cases for this kind of technology, at least for now, are extremely limited. Basically if you need to take an automatized decision in real time on big, complex data, you need a real-time fast data engine. In any other case, including near-real-time, real-time streaming is an overkill and mini-batch is fine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!