Skewed dataset join in Spark?

后端 未结 5 1883

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenar

5条回答
  •  不知归路
    2020-12-01 02:16

    Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

    Short version:

    • Add random element to large RDD and create new join key with it
    • Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
    • Join RDDs on new join key which will now be distributed better due to random seeding

提交回复
热议问题