Skewed dataset join in Spark?

后端 未结 5 1879

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenar

相关标签:
5条回答
  • 2020-12-01 02:04

    You could try to repartition the "skewed" RDD to more partitions, or try to increase spark.sql.shuffle.partitions (which is by default 200).

    In your case, I would try to set the number of partitions to be much higher than the number of executors.

    0 讨论(0)
  • 2020-12-01 02:06

    Taking reference from https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/ below is the code for fighting the skew in spark using Pyspark dataframe API

    Creating the 2 dataframes:

    from math import exp
    from random import randint
    from datetime import datetime
    
    def count_elements(splitIndex, iterator):
        n = sum(1 for _ in iterator)
        yield (splitIndex, n)
    
    def get_part_index(splitIndex, iterator):
        for it in iterator:
            yield (splitIndex, it)
    
    num_parts = 18
    # create the large skewed rdd
    skewed_large_rdd = sc.parallelize(range(0,num_parts), num_parts).flatMap(lambda x: range(0, int(exp(x))))
    skewed_large_rdd = skewed_large_rdd.mapPartitionsWithIndex(lambda ind, x: get_part_index(ind, x))
    
    skewed_large_df = spark.createDataFrame(skewed_large_rdd,['x','y'])
    
    small_rdd = sc.parallelize(range(0,num_parts), num_parts).map(lambda x: (x, x))
    
    small_df = spark.createDataFrame(small_rdd,['a','b'])
    

    Dividing the data into 100 bins for large df and replicating the small df 100 times

    salt_bins = 100
    from pyspark.sql import functions as F
    
    skewed_transformed_df = skewed_large_df.withColumn('salt', (F.rand()*salt_bins).cast('int')).cache()
    
    small_transformed_df = small_df.withColumn('replicate', F.array([F.lit(i) for i in range(salt_bins)]))
    
    small_transformed_df = small_transformed_df.select('*', F.explode('replicate').alias('salt')).drop('replicate').cache()
    

    Finally the join avoiding the skew

    t0 = datetime.now()
    result2 = skewed_transformed_df.join(small_transformed_df, (skewed_transformed_df['x'] == small_transformed_df['a']) & (skewed_transformed_df['salt'] == small_transformed_df['salt']) )
    result2.count() 
    print "The direct join takes %s"%(str(datetime.now() - t0))
    
    0 讨论(0)
  • 2020-12-01 02:16

    Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

    Short version:

    • Add random element to large RDD and create new join key with it
    • Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
    • Join RDDs on new join key which will now be distributed better due to random seeding
    0 讨论(0)
  • 2020-12-01 02:18

    Depending on the particular kind of skew you're experiencing, there may be different ways to solve it. The basic idea is:

    • Modify your join column, or create a new join column, that is not skewed but which still retains adequate information to do the join
    • Do the join on that non-skewed column -- resulting partitions will not be skewed
    • Following the join, you can update the join column back to your preferred format, or drop it if you created a new column

    The "Fighting the Skew In Spark" article referenced in LiMuBei's answer is a good technique if the skewed data participates in the join. In my case, skew was caused by a very large number of null values in the join column. The null values were not participating in the join, but since Spark partitions on the join column, the post-join partitions were very skewed as there was one gigantic partition containing all of the nulls.

    I solved it by adding a new column which changed all null values to a well-distributed temporary value, such as "NULL_VALUE_X", where X is replaced by random numbers between say 1 and 10,000, e.g. (in Java):

    // Before the join, create a join column with well-distributed temporary values for null swids.  This column
    // will be dropped after the join.  We need to do this so the post-join partitions will be well-distributed,
    // and not have a giant partition with all null swids.
    String swidWithDistributedNulls = "swid_with_distributed_nulls";
    int numNullValues = 10000; // Just use a number that will always be bigger than number of partitions
    Column swidWithDistributedNullsCol =
        when(csDataset.col(CS_COL_SWID).isNull(), functions.concat(
            functions.lit("NULL_SWID_"),
            functions.round(functions.rand().multiply(numNullValues)))
        )
        .otherwise(csDataset.col(CS_COL_SWID));
    csDataset = csDataset.withColumn(swidWithDistributedNulls, swidWithDistributedNullsCol);
    

    Then joining on this new column, and then after the join:

    outputDataset.drop(swidWithDistributedNullsCol);
    
    0 讨论(0)
  • 2020-12-01 02:26

    Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.

    i.e. select A.id from A join B on A.id = B.id

    There are two basic approaches to solve the skew join issue:

    Approach 1:

    Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -

     1. select A.id from A join B on A.id = B.id where A.id <> 1;
     2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
    

    The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.

    If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.

    Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

    The partial results of the two queries can then be merged to get the final results.

    Approach 2:

    Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column. Steps:

    1. Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.

    2. Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.

    After 1 and 2, join the 2 datasets/tables with join condition updated to-

                    *A.id = B.id && A.skewLeft = B.skewRight*
    

    Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

    0 讨论(0)
提交回复
热议问题