Skewed dataset join in Spark?

后端 未结 5 1881

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenar

5条回答
  •  执念已碎
    2020-12-01 02:26

    Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.

    i.e. select A.id from A join B on A.id = B.id

    There are two basic approaches to solve the skew join issue:

    Approach 1:

    Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -

     1. select A.id from A join B on A.id = B.id where A.id <> 1;
     2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
    

    The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.

    If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.

    Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

    The partial results of the two queries can then be merged to get the final results.

    Approach 2:

    Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column. Steps:

    1. Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.

    2. Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.

    After 1 and 2, join the 2 datasets/tables with join condition updated to-

                    *A.id = B.id && A.skewLeft = B.skewRight*
    

    Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

提交回复
热议问题