Apache Beam (Batch + strEAM) is a model and set of APIs for doing both batch and streaming data processing. It was open-sourced by Google (with Cloudera and PayPal) in 2016 via an Apache incubator project.
The page Dataflow/Beam & Spark: A Programming Model Comparison - Cloud Dataflow contrasts the Beam API with Apache Spark, which has been hugely successful at bringing a modern, flexible API and set of optimization techniques for both batch and streaming to the Hadoop world and beyond.
Beam tries to take all that a step further via a model that makes it easy to describe the various aspects of the out-of-order processing that often is an issue when combining batch and streaming processing, as described in that Programming Model Comparison.
In particular, to quote from the comparison, The Dataflow model is designed to address, elegantly and in a way that is more modular, robust and easier to maintain:
... the four critical questions all data processing practitioners must attempt to answer when building their pipelines:
- What results are calculated? Sums, joins, histograms, machine learning models?
- Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated in fixed windows, sessions, or a single global window?
- When in processing time are results materialized? Does the time each event is observed within the system affect results? When are results emitted? Speculatively, as data evolve? When data arrive late and results must be revised? Some combination of these?
- How do refinements of results relate? If additional data arrive and results change, are they independent and distinct, do they build upon one another, etc.?
The pipelines described in Beam can in turn be run on Spark, Flink, Google's Dataflow offering in the cloud, and other "runtimes", including a "Direct" local machine option.
A variety of languages are supported by the architecture. The Java SDK is available now. A Dataflow Python SDK is nearing release, and others are envisioned for Scala etc.
See the source at Mirror of Apache Beam