Spark: Order of column arguments in repartition vs partitionBy

后端 未结 2 1701
时光说笑
时光说笑 2021-01-02 05:40

Methods taken into consideration (Spark 2.2.1):

  1. DataFrame.repartition (the two implementations that take partitionExprs: Column
2条回答
  •  悲&欢浪女
    2021-01-02 06:00

    The only similarity between these two methods are their names. There are used for different things and have different mechanics so you shouldn't compare them at all.

    That being said, repartition shuffles data using:

    • With partitionExprs it uses hash partitioner on the columns used in the expression using spark.sql.shuffle.partitions.
    • With partitionExprs and numPartitions it does the same as the previous one, but overriding spark.sql.shuffle.partitions.
    • With numPartitions it just rearranges data using RoundRobinPartitioning.

    the order of column inputs relevant in repartition method too?

    It is. hash((x, y)) is in general not the same as hash((y, x)).

    df = (spark.range(5, numPartitions=4).toDF("x")
        .selectExpr("cast(x as string)")
        .crossJoin(spark.range(5, numPartitions=4).toDF("y")))
    
    df.repartition(4, "y", "x").rdd.glom().map(len).collect()
    
    [8, 6, 9, 2]
    
    df.repartition(4, "x", "y").rdd.glom().map(len).collect()
    
    [6, 4, 3, 12]
    

    Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns?

    Depending on what is the exact question.

    • Yes. GROUP BY with the same set of columns will result in the same logical distribution of keys over partitions.
    • No. Hash partitioner can map multiple keys to the same partition. GROUP BY "sees" only the actual groups.

    Related How to define partitioning of DataFrame?

提交回复
热议问题