发表新帖

发表新帖

Why Apache Spark take function not parallel?

前端未结

关注

 2  501

春和景丽 2020-12-19 17:29

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

$\"enter$

2条回答

自闭症患者 (楼主)

2020-12-19 18:24

Actually, while take is not entirely parallel, it's not entirely sequential either.

For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.

Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049

I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题