问题
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.
If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?
回答1:
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.
So, it would go something like this:
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce
down to 2 partitions:
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)
Notice that Node 1 and Node 3 did not require its original data to move.
回答2:
Justin's answer is awesome and this response goes into more depth.
The repartition
algorithm does a full shuffle and creates new partitions with data that's distributed evenly. Let's create a DataFrame with the numbers from 1 to 12.
val x = (1 to 12).toList
val numbersDf = x.toDF("number")
numbersDf
contains 4 partitions on my machine.
numbersDf.rdd.partitions.size // => 4
Here is how the data is divided on the partitions:
Partition 00000: 1, 2, 3
Partition 00001: 4, 5, 6
Partition 00002: 7, 8, 9
Partition 00003: 10, 11, 12
Let's do a full-shuffle with the repartition
method and get this data on two nodes.
val numbersDfR = numbersDf.repartition(2)
Here is how the numbersDfR
data is partitioned on my machine:
Partition A: 1, 3, 4, 6, 7, 9, 10, 12
Partition B: 2, 5, 8, 11
The repartition
method makes new partitions and evenly distributes the data in the new partitions (the data distribution is more even for larger data sets).
Difference between coalesce
and repartition
coalesce
uses existing partitions to minimize the amount of data that's shuffled. repartition
creates new partitions and does a full shuffle. coalesce
results in partitions with different amounts of data (sometimes partitions that have much different sizes) and repartition
results in roughly equal sized partitions.
Is coalesce
or repartition
faster?
coalesce
may run faster than repartition
, but unequal sized partitions are generally slower to work with than equal sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition
to be faster overall because Spark is built to work with equal sized partitions.
Read this blog post if you'd like even more details.
回答3:
One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD.
scala> pairMrkt.repartition(10)
res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at <console>:26
scala> res16.partitions.length
res17: Int = 10
scala> pairMrkt.partitions.length
res20: Int = 2
回答4:
All the answers are adding some great knowledge into this very often asked question.
So going by tradition of this question's timeline, here are my 2 cents.
I found the repartition to be faster than coalesce, in very specific case.
In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster.
Here is what I mean
if(numFiles > 20)
df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.
Of course, this number (20) will depend on the number of workers and amount of data.
Hope that helps.
回答5:
repartition
- its recommended to use repartition while increasing no of partitions, because it involve shuffling of all the data.
coalesce
- it’s is recommended to use coalesce while reducing no of partitions. For example if you have 3 partitions and you want to reduce it to 2 partitions, Coalesce will move 3rd partition Data to partition 1 and 2. Partition 1 and 2 will remains in same Container.but repartition will shuffle data in all partitions so network usage between executor will be high and it impacts the performance.
Performance wise coalesce
performance better than repartition
while reducing no of partitions.
回答6:
What follows from the code and code docs is that coalesce(n)
is the same as coalesce(n, shuffle = false)
and repartition(n)
is the same as coalesce(n, shuffle = true)
Thus, both coalesce
and repartition
can be used to increase number of partitions
With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large.
Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce
(same as repartition
in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Please also refer to the related answer here
回答7:
To all the great answers I would like to add that re-partition is one the best option to take advantage of data parallelization and coalesce gives cheap option to reduce the partition and very useful when writing data to HDFS or some other sink to take advantage of big writes. I have found this useful when writing data in parquet format to get full advantage.
回答8:
I would like to add to Justin and Power's answer that -
"repartition" will ignore existing partitions and create new ones. So you can use it to fix data skew. You can mention partition keys to define the distribution. Data skew is one of the biggest problems in 'big data' problem space.
"coalesce" will work with existing partitions and shuffle a subset of them. It can't fix the data skew as much as "repartition" can. so even if it is less expensive it mayn't be the thing you need.
回答9:
For someone who had issues generating a single csv file from PySpark (AWS EMR) as an output and saving it on s3, using repartition helped. The reason being, coalesce cannot do a full shuffle, but repartition can. Essentially, you can increase or decrease the number of partitions using repartition, but can only decrease the number of partitions (but not 1) using coalesce. Here is the code for anyone who is trying to write a csv from AWS EMR to s3:
df.repartition(1).write.format('csv')\
.option("path", "s3a://my.bucket.name/location")\
.save(header = 'true')
回答10:
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions
REPARTITION:- is for both increase and decrease the no of partitions , But shuffling takes place
Example:-
val rdd = sc.textFile("path",7)
rdd.repartition(10)
rdd.repartition(2)
Both works fine
But we go generally for this two things when we need to see output in one cluster,we go with this.
回答11:
But also you should make sure that, the data which is coming coalesce nodes should have highly configured, if you are dealing with huge data. Because all the data will be loaded to those nodes, may lead memory exception. Though reparation is costly, i prefer to use it. Since it shuffles and distribute the data equally.
Be wise to select between coalesce and repartition.
回答12:
Repartition:- shuffle the data into NEW number of partitions
Eg. our initial data frame is partitioned on 200 partition.
df.repartition(500) : Data will be shuffled from 200 partitions to new 500 partitions
Coalesce: shuffle the data into number of partitions
df.coalesce(5): data will be shuffled from remaining 15 partitions
来源:https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce