I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenar
Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.
i.e. select A.id from A join B on A.id = B.id
There are two basic approaches to solve the skew join issue:
Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -
1. select A.id from A join B on A.id = B.id where A.id <> 1;
2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.
Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
The partial results of the two queries can then be merged to get the final results.
Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column. Steps:
Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
*A.id = B.id && A.skewLeft = B.skewRight*
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/