Apache Flink DataStream API doesn't have a mapPartition transformation

问题

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.

回答1:

Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.

In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.

In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.

This could look like:

yourStream.keyBy("myKey") // organize stream by key "myKey"
          .timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
          .reduce(new YourReduceFunction); // apply a reduce function on each window

The DataStream documentation shows how to define various window types and explains all available functions.

Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.

回答2:

Assuming your input stream is single partition data (say String)

val new_number_of_partitions = 4

//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)

//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
  // var local_val_to_different_part : Type = null
  var myTaskId : Int = null

  //below function is executed once for each mapper function (one mapper per partition)
  override def open(config: Configuration): Unit = {
    myTaskId = getRuntimeContext.getIndexOfThisSubtask
    //do whatever initialization you want to do. read from data sources..
  }

  def map(value: String): (String, Int) = {
    (value, myTasKId)
  }
})

val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function

Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

来源：https://stackoverflow.com/questions/33401332/apache-flink-datastream-api-doesnt-have-a-mappartition-transformation

标签

apache-flink