differentiate driver code and work code in Apache Spark

前端 未结 2 1687
迷失自我
迷失自我 2021-01-05 05:11

In Apache Spark program how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?

With Regards

2条回答
  •  Happy的楠姐
    2021-01-05 05:42

    It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...), filter(...), mapPartitions(...), groupBy*(...), aggregateBy*(...) is executed on the workers. It includes reading data from a persistent storage or remote sources.

    Actions like count, reduce(...), fold(...) are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.

    Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext. In PySpark it means also a communication with Py4j gateway.

提交回复
热议问题