Spark: How to join RDDs by time range

后端 未结 3 677
挽巷
挽巷 2021-02-05 13:27

I have a delicate Spark problem, where i just can\'t wrap my head around.

We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contai

3条回答
  •  花落未央
    2021-02-05 14:13

    After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.

    First we expand our case class Historic

    case class Historic(id: String, set_at: Long, valueY: Int) {
      val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
      set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
      set_at_map.put(set_at, valueY) // .. to the set_at date
    
      // This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
      def getHistoricValue(date: Long) : Option[Int] = {
        var e = set_at_map.floorEntry(date)                                   
        if (e != null && e.getValue == null) {                                  
          e = set_at_map.lowerEntry(date)                                     
        }                                                                         
        if ( e == null ) None else e.getValue()
      }
    }
    

    The case class is ready and now we bring it into action

    val historicRDD = sc.cassandraTable[Historic](...)
      .map( row => ( row.id, row ) )
      .reduceByKey( (row1, row2) =>  {
        row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
        row1
      })
    
    // Now we load the Actions and map it by id as we did with Historic
    val actionsRDD = sc.cassandraTable[Actions](...)
      .map( row => ( row.id, row ) )
    
    // Now both RDDs have the same key and we can join them
    val fin = actionsRDD.join(historicRDD)
      .map( row => {
        ( row._1.id, 
          (
            row._2._1.id, 
            row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
          )
        )
      })
    

    I am totally new to Scala, so please let me know if we could improve this code on some place.

提交回复
热议问题