I have a delicate Spark problem, where i just can\'t wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions
and RDD2 contai
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.