Why Apache Spark take function not parallel?

前端 未结 2 498
春和景丽
春和景丽 2020-12-19 17:29

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

\"enter

2条回答
  •  再見小時候
    2020-12-19 18:18

    How would you implement it in parallel? Let's say you have 4 partitions and want to take first 5 elements. If you knew in advance the size of each partition, it would be easy: for example, if each partition has 3 elements driver asks partition 0 for all elements and it asks partition 1 for 2 elements. So the problem is that it isn't known how many elements each partition has.

    Now, you could first calculate partition sizes, but this requires limiting the set of RDD transformations supported, calculating elements more than once, or some other tradeoff, and will generally need more communication overhead.

提交回复
热议问题