Spark: How to join RDDs by time range

后端未结

关注

 3  717

挽巷 2021-02-05 13:27

I have a delicate Spark problem, where i just can\'t wrap my head around.

We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contai

3条回答

栀梦 (楼主)

2021-02-05 14:03
I know that this question has been answered but I want to add another solution that worked for me -

your data -
```
Actions 
id  |  time  | valueX
1   |  12:05 | 500
1   |  12:30 | 500
2   |  12:30 | 125

Historic 
id  |  set_at| valueY
1   |  11:00 | 400
1   |  12:15 | 450
2   |  12:20 | 50
2   |  12:25 | 75
```
1. Union Actions and Historic
```
    Combined
    id  |  time  | valueX | record-type
    1   |  12:05 | 500    | Action
    1   |  12:30 | 500    | Action
    2   |  12:30 | 125    | Action
    1   |  11:00 | 400    | Historic
    1   |  12:15 | 450    | Historic
    2   |  12:20 | 50     | Historic
    2   |  12:25 | 75     | Historic
```
1. Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
```
Partition-1
1   |  11:00 | 400    | Historic
1   |  12:05 | 500    | Action
1   |  12:15 | 450    | Historic
1   |  12:30 | 500    | Action
Partition-2
2   |  12:20 | 50     | Historic
2   |  12:25 | 75     | Historic
2   |  12:30 | 125    | Action
```
2. Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.

If it is a Action record, get the valueY value from the map and subtract it from valueX

A map M
Partition-1 traversal in order M={ 1 -> 400} // A new entry in map M 1 | 100 // M(1) = 400; 500-400 M={1 -> 450} // update M, because key already exists 1 | 50 // M(1) Partition-2 traversal in order M={ 2 -> 50} // A new entry in M M={ 2 -> 75} // update M, because key already exists 2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.

This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...