Efficient way of joining multiple tables in Spark - No space left on device

后端 未结 2 693
栀梦
栀梦 2021-01-06 20:50

A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows a

2条回答
  •  暖寄归人
    2021-01-06 21:29

    I've had similar problem in past except didn't have that many RDDs. The most efficient solution I could find was to use the low level RDD API. First store all the RDDs so that they are (hash) partitioned and sorted within partitions by the join column(s): https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/OrderedRDDFunctions.html#repartitionAndSortWithinPartitions-org.apache.spark.Partitioner-

    After this the join can be implemented using zip partitions without shuffling or using much memory: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/RDD.html#zipPartitions-org.apache.spark.rdd.RDD-boolean-scala.Function2-scala.reflect.ClassTag-scala.reflect.ClassTag-

提交回复
热议问题