Spark: How to join RDDs by time range

后端 未结 3 717
挽巷
挽巷 2021-02-05 13:27

I have a delicate Spark problem, where i just can\'t wrap my head around.

We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contai

3条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-05 14:03

    I know that this question has been answered but I want to add another solution that worked for me -

    your data -

    Actions 
    id  |  time  | valueX
    1   |  12:05 | 500
    1   |  12:30 | 500
    2   |  12:30 | 125
    
    Historic 
    id  |  set_at| valueY
    1   |  11:00 | 400
    1   |  12:15 | 450
    2   |  12:20 | 50
    2   |  12:25 | 75
    
    1. Union Actions and Historic
        Combined
        id  |  time  | valueX | record-type
        1   |  12:05 | 500    | Action
        1   |  12:30 | 500    | Action
        2   |  12:30 | 125    | Action
        1   |  11:00 | 400    | Historic
        1   |  12:15 | 450    | Historic
        2   |  12:20 | 50     | Historic
        2   |  12:25 | 75     | Historic
    
    1. Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.

      Partition-1
      1   |  11:00 | 400    | Historic
      1   |  12:05 | 500    | Action
      1   |  12:15 | 450    | Historic
      1   |  12:30 | 500    | Action
      Partition-2
      2   |  12:20 | 50     | Historic
      2   |  12:25 | 75     | Historic
      2   |  12:30 | 125    | Action
      

    2. Traverse through the records per partition.

    If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.

    If it is a Action record, get the valueY value from the map and subtract it from valueX

    A map M

    Partition-1 traversal in order M={ 1 -> 400} // A new entry in map M 1 | 100 // M(1) = 400; 500-400 M={1 -> 450} // update M, because key already exists 1 | 50 // M(1) Partition-2 traversal in order M={ 2 -> 50} // A new entry in M M={ 2 -> 75} // update M, because key already exists 2 | 50 // M(2) = 75; 125-75

    You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.

    This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.

提交回复
热议问题