问题
I have input lines like below
t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2
and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers.
file1 : [ 1+1+5, 1+2+5, 1+3+5 ]
file2 : [ 2+1, 2+1, 2+2, 2+2 ]
I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name.
It seems like i will need to use something like below, i am not sure how to get to the correct syntax. Any inputs will be helpful.
myDStream.foreachRDD(rdd => rdd.groupBy())
or
myDStream.foreachRDD(rdd => rdd.aggregate())
I know how to do the vertical sum of array of given numbers, but i am not sure how to feed that function to the aggregator.
def compute_counters(counts : ArrayBuffer[List[Int]]) = {
counts.toList.transpose.map(_.sum)
}
回答1:
First, you need to extract the relevant key and values from the comma separated string, parse them, and create a tuple which contains the key, and the list of integers using InputDStream.map. Then, use PairRDDFunctions.reduceByKey to apply the sum per key:
dStream
.map(line => {
val splitLines = line.split(", ")
(splitLines(1), splitLines.slice(2, splitLines.length).map(_.toInt))
})
.reduceByKey((first, second) => (first._1, Array(first._2.sum + second._2.sum))
.foreachRDD((key, sum) => println(s"Key: $key, sum: ${sum.head}")
The reduce will yield a tuple of (String, Array[Int])
, where the string contains the id (be it "test1" or "test2"), and an array with a single value, containing the sum per key.
回答2:
Thanks Yuval, I was able to do it using your approach. Updating my final working code:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(2))
val inputStream = ssc.socketTextStream("hostname", 9999)
val parsedDstream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map(_.trim.toInt))
})
.reduceByKey((first, second) => {
val listOfArrays = ArrayBuffer(first, second)
listOfArrays.toList.transpose.map(_.sum).toArray
})
.foreachRDD(rdd => rdd.foreach(Blaher.blah))
}
来源:https://stackoverflow.com/questions/35539141/spark-streaming-group-by-custom-function