Methods taken into consideration (Spark 2.2.1
):
DataFrame.repartition
(the two implementations that take partitionExprs: Column
The only similarity between these two methods are their names. There are used for different things and have different mechanics so you shouldn't compare them at all.
That being said, repartition
shuffles data using:
partitionExprs
it uses hash partitioner on the columns used in the expression using spark.sql.shuffle.partitions
.partitionExprs
and numPartitions
it does the same as the previous one, but overriding spark.sql.shuffle.partitions
.numPartitions
it just rearranges data using RoundRobinPartitioning
.the order of column inputs relevant in repartition method too?
It is. hash((x, y))
is in general not the same as hash((y, x))
.
df = (spark.range(5, numPartitions=4).toDF("x")
.selectExpr("cast(x as string)")
.crossJoin(spark.range(5, numPartitions=4).toDF("y")))
df.repartition(4, "y", "x").rdd.glom().map(len).collect()
[8, 6, 9, 2]
df.repartition(4, "x", "y").rdd.glom().map(len).collect()
[6, 4, 3, 12]
Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns?
Depending on what is the exact question.
GROUP BY
with the same set of columns will result in the same logical distribution of keys over partitions.GROUP BY
"sees" only the actual groups.Related How to define partitioning of DataFrame?