Why Apache Spark take function not parallel?

前端 未结 2 496
春和景丽
春和景丽 2020-12-19 17:29

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

\"enter

2条回答
  •  自闭症患者
    2020-12-19 18:24

    Actually, while take is not entirely parallel, it's not entirely sequential either.

    For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.

    Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049

    I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.

提交回复
热议问题