What are the differences between kappa-architecture and lambda-architecture

问题

If the Kappa-Architecture does analysis on stream directly instead of splitting the data into two streams, where is the datastored then, in a messagin-system like Kafka? or can it be in a database for recomputing?

And is a seperate batch layer faster than recomputing with a stream processing engine for batch analytics?

回答1:

"A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use the same code base to process historical and real-time data, and therefore to implement the use-case using the Kappa architecture". "Now, the algorithms used to process historical data and real-time data are not always identical. In some cases, the batch algorithm can be optimized thanks to the fact that it has access to the complete historical dataset, and then outperform the implementation of the real-time algorithm. Here, choosing between Lambda and Kappa becomes a choice between favoring batch execution performance over code base simplicity". "Finally, there are even more complex use-cases, in which even the outputs of the real-time and batch algorithm are different. For example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used".

Quote

Seperate Batch and Stream-Layer
Higher code complexity
Faster performance with seperate batch/stream
better for different algorithms in batch and stream
cheaper with a data storage for batch-computing instead of a database

only a steam processing layer
easier to maintain, lower complexity, single algorithm for batch and stream
too much data would be expensive if recomputed from a database for batch
too much data would be slower to process if recomputed from database or from kafka for batch

回答2:

You may also like to read the original article discussing the two here

Quoting the original blog post

"The efficiency and resource trade-offs between the two approaches are somewhat of a wash. The Lambda Architecture requires running both reprocessing and live processing all the time, whereas what I have proposed only requires running the second copy of the job when you need reprocessing. However, my proposal requires temporarily having 2x the storage space in the output database and requires a database that supports high-volume writes for the re-load. In both cases, the extra load of the reprocessing would likely average out. If you had many such jobs, they wouldn’t all reprocess at once, so on a shared cluster with several dozen such jobs you might budget an extra few percent of capacity for the few jobs that would be actively reprocessing at any given time.

The real advantage isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. So, in cases where simplicity is important, consider this approach as an alternative to the Lambda Architecture."

来源：https://stackoverflow.com/questions/41967295/what-are-the-differences-between-kappa-architecture-and-lambda-architecture

标签

apache-kafka

batch-processing

stream-processing

lambda-architecture

bigdata