Skewed dataset join in Spark?

后端未结

关注

 5  1881

不要未来只要你来 2020-12-01 01:28

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenar

5条回答

执念已碎 (楼主)

2020-12-01 02:26
Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.

i.e. select A.id from A join B on A.id = B.id

There are two basic approaches to solve the skew join issue:

Approach 1:

Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -
```
 1. select A.id from A join B on A.id = B.id where A.id <> 1;
 2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
```
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.

If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.

Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

The partial results of the two queries can then be merged to get the final results.

Approach 2:

Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column. Steps:
1. Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
2. Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
```
                *A.id = B.id && A.skewLeft = B.skewRight*
```
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

Skewed dataset join in Spark?

Approach 1:

Approach 2: