Spark pyspark vs spark-submit

三世轮回 提交于 2019-12-22 10:06:17

问题


The documentation on spark-submit says the following:

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.

Regarding the pyspark it says the following:

You can also use bin/pyspark to launch an interactive Python shell.

This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?


回答1:


There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.




回答2:


If you are using EMR , there are three things

  1. using pyspark(or spark-shell)
  2. using spark-submit without using --master and --deploy-mode
  3. using spark-submit and using --master and --deploy-mode

although using all the above three will run the application in spark cluster, there is a difference how the driver program works.

  • in 1st and 2nd the driver will be in client mode whereas in 3rd the driver will also be in the cluster.
  • in 1st and 2nd, you will have to wait untill one application complete to run another, but in 3rd you can run multiple applications in parallel.



回答3:


Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):

..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?

As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.

As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.




回答4:


  1. Pyspark compare to Scala spark and Java Spark have extremely differences, for Python spark in only support YARN for scheduling the cluster.
  2. If you are running python spark in local machine then you can use pyspark. If in cluster use the spark-submit.

  3. if you have any dependencies in your python spark job you need package as zip file for submit.



来源:https://stackoverflow.com/questions/36910014/spark-pyspark-vs-spark-submit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!