Apache Spark: Differences between client and cluster deploy modes

后端未结

关注

 3  1365

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to r

相关标签:

3条回答

臣服心动

2020-11-30 18:28
Let's say you are going to perform a spark submit in EMR by doing SSH to the master node. If you are providing the option --deploy-mode cluster, then following things will happen.
1. You won't be able to see the detailed logs in the terminal.
2. Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
1. You will be able to see the detailed logs in the terminal.
2. You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.
0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-11-30 18:31
What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

Let's try to look at the differences between client and cluster mode.

Client:
- Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
- Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
- Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
- If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
- Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
- Driver runs as a dedicated, standalone process inside the Worker.
- Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
- Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
- When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

How to I choose which one my application is going to be running on, using spark-submit

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
```
/bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-11-30 18:35

I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.

Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

0 讨论(0)
发布评论:

提交评论
- 加载中...