Spark streaming group by custom function

问题

I have input lines like below

t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2

and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers.

file1 : [ 1+1+5, 1+2+5, 1+3+5 ]
file2 : [ 2+1, 2+1, 2+2, 2+2 ]

I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name.

It seems like i will need to use something like below, i am not sure how to get to the correct syntax. Any inputs will be helpful.

myDStream.foreachRDD(rdd => rdd.groupBy()) or myDStream.foreachRDD(rdd => rdd.aggregate())

I know how to do the vertical sum of array of given numbers, but i am not sure how to feed that function to the aggregator.

def compute_counters(counts : ArrayBuffer[List[Int]]) = {
  counts.toList.transpose.map(_.sum)
}

回答1:

First, you need to extract the relevant key and values from the comma separated string, parse them, and create a tuple which contains the key, and the list of integers using InputDStream.map. Then, use PairRDDFunctions.reduceByKey to apply the sum per key:

dStream
.map(line => {
  val splitLines = line.split(", ")
  (splitLines(1), splitLines.slice(2, splitLines.length).map(_.toInt))
})
.reduceByKey((first, second) => (first._1, Array(first._2.sum + second._2.sum))
.foreachRDD((key, sum) => println(s"Key: $key, sum: ${sum.head}")

The reduce will yield a tuple of (String, Array[Int]), where the string contains the id (be it "test1" or "test2"), and an array with a single value, containing the sum per key.

回答2:

Thanks Yuval, I was able to do it using your approach. Updating my final working code:

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("HBaseStream")
    val sc = new SparkContext(conf)
    // create a StreamingContext, the main entry point for all streaming functionality
    val ssc = new StreamingContext(sc, Seconds(2))
    val inputStream = ssc.socketTextStream("hostname", 9999)
    val parsedDstream = inputStream
      .map(line => {
        val splitLines = line.split(",")
        (splitLines(1), splitLines.slice(2, splitLines.length).map(_.trim.toInt))
      })
      .reduceByKey((first, second) => {
        val listOfArrays = ArrayBuffer(first, second)
        listOfArrays.toList.transpose.map(_.sum).toArray
      })
      .foreachRDD(rdd => rdd.foreach(Blaher.blah))
}

来源：https://stackoverflow.com/questions/35539141/spark-streaming-group-by-custom-function

标签

scala

apache-spark

spark-streaming