问题
I am joining 2 datasets by two columns and result is dataset containing 55 billion rows. After that I have to do some aggregation on this DS by different column than the ones used in join. Problem is that Spark is doing exchange partition after join(taking too much time with 55 billion rows) although data is already correctly distributed because aggregate column is unique. I know that aggregation key is correctly distributed and is there a way telling this to Spark app?
回答1:
1) Go to Spark UI and check "Locality Level"
2) If Joining a large and a small data use brodcast Join
3) If Joining a large and a medium size data and If the medium size RDD does not fit fully into memory use filter
val keys = sc.broadcast(mediumRDD.map(_._1).collect.toSet)
val reducedRDD = largeRDD.filter{ case(key, value) => keys.value.contains(key) }
reducedRDD.join(mediumRDD)
4) Check is data serilize or not
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max", "128m")
.set("spark.kryoserializer.buffer", "64m")
.registerKryoClasses(
Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]])
5) Check on Spark UI or add following line in code for debugging
df.rdd.getNumPartitions
Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions
来源:https://stackoverflow.com/questions/46951207/spark-doing-exchange-of-partitions-already-correctly-distributed