Does a flatMap in spark cause a shuffle?

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-01 05:05:26

问题


Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it?


回答1:


There is no shuffling with either map or flatMap. The operations that cause shuffle are:

  • Repartition operations:
    • Repartition:
    • Coalesce:
  • ByKey operations (except for counting):
    • GroupByKey:
    • ReduceByKey:
  • Join operations:
    • Cogroup:
    • Join:

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

  • mapPartitions to sort each partition using, for example, .sorted
  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
  • sortBy to make a globally ordered RDD

More info here: http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations




回答2:


No shuffling. Here are the sources for both functions:

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

As you can see, RDD.flatMap just calls flatMap on Scala's iterator that represents partition.




回答3:


flatMap may cause shuffle write in some cases. like if you are generating multiple elements into the same partition and that element can't fit into the same partition then it writes those into a different partition.

like in below example :

val rdd = RDD[BigObject]

rdd.flatMap{ bigObject => 
    val rangList: List[Int] = List.range(1, 1000)
    rangList.map( num => (num, bigObject))
}

Above code will run on the same partition but since we are creating too many instances of BigObject , it will write those objects into separate partitions which will cause shuffle write



来源:https://stackoverflow.com/questions/36414123/does-a-flatmap-in-spark-cause-a-shuffle

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!