How to debug Spark application on Spark Standalone?

后端 未结 4 1118
北恋
北恋 2020-12-13 20:02

I am trying to debug a Spark Application on a cluster using a master and several worker nodes. I have been successful at setting up the master node and worker nodes using Sp

4条回答
  •  没有蜡笔的小新
    2020-12-13 20:53

    I had followed the same steps to setup a spark standalone cluster. I was able to debug the driver, master , worked and executor JVM's.

    The master and the worker node is configured on a server class machine. The machine has 12 CPU cores. Source code for Spark -2.2.0 has been cloned from the Spark Git Repo.

    STEPS:

    1] Command to launch the Master JVM:

    root@ubuntu:~/spark-2.2.0-bin-hadoop2.7/bin#
        ./spark-class  -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8787     org.apache.spark.deploy.master.Master
    

    The shell script spark-class is used to launch the master manually. The first args are JVM args launching the master in debug mode. The JVM is suspended and waits for the IDE to make a remote connection.

    Following are the screenshots showing the IDE configuration for Remote Debugging:

    2] Command to launch the Worker JVM:

    root@ubuntu:~/spark-2.2.0-bin-hadoop2.7/bin# ./spark-class -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8788 org.apache.spark.deploy.worker.Worker spark://10.71.220.34:7077
    

    Same as master, the last argument specifies the address of the spark master. The debug port for worker is 8788.

    As part of the launch the worker registers with the master.

    Screenshot

    3] A Basic java app with a main method is compiled and wrapped in an uber/fat jar. This has been explained in the text “learning spark”. Basically an Uber jar contains all transitive dependencies.

    Created by running mvn package at the following directory:

    root@ubuntu:/home/customer/Documents/Texts/Spark/learning-spark-master# mvn package
    

    The above generates a jar under ./target folder

    The screen shot below is the java application, which would be submitted to the Spark Cluster:

    4] Command to submit the command to the standalone cluster

    root@ubuntu:/home/customer/Documents/Texts/Spark/learning-spark-master# /home/customer/spark-2.2.0-bin-hadoop2.7/bin/spark-submit --master spark://10.71.220.34:7077
    
    --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8790"
    
    --conf "spark.executor.extraClassPath=/home/customer/Documents/Texts/Spark/
    learning-spark-master/target/java-0.0.2.jar" 
    
    --class com.oreilly.learningsparkexamples.java.BasicMapToDouble
    
    --name "MapToDouble"
    
    ./target/java-0.0.2.jar
    
    spark://10.71.220.34:7077 → Argument to the java program → 
    
    com.oreilly.learningsparkexamples.java.BasicMapToDouble
    

    · The above command is from the client node that runs the application with the main method in it. However the transformations are executed on the remote executor JVM.

    ·

    • The –conf parameters are important. They are used to configure the executor JVM's. The Executor JVMS are launched at runtime by the Worker JVM's.

      · The first conf parameter specifies that the Executor JVM should be launched in debug mode and suspended right away. It comes up on port 8790.

      · The second conf parameter specifies that the executor class path should contain the application specific jars that are submitted to the executor. On a distributed setup these jars need to be moved to the Executor JVM machine.

      · The last argument is used by the client app to connect to the spark master.

    To understand, how the client application connects to the Spark cluster, we need to debug the client app and step through it. For that we need to configure it to run in the debug mode.

    To debug the client, we need to edit the script spark-submit as follows:

    Contents from the spark-submit

    exec "${SPARK_HOME}"/bin/spark-class -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8789 org.apache.spark.deploy.SparkSubmit "$@"
    

    5] After the client registers, The worker starts an executor at run-time on a different thread.

    Screenshot below shows the class ExecutorRunner.scala

    6] We now connect to the forked executor JVM using the IDE. Executor JVM would run the transformation functions in our submitted application.

     JavaDoubleRDD result = rdd.mapToDouble( → ***Transformation function/lambda***
          new DoubleFunction() {
            public double call(Integer x) {
              double y = (double) x;
              return y * y;
            }
          });
    

    7] The transformation function runs, only when the action “collect” will be invoked.

    8] The screenshot below displays the Executor view, when the mapToDouble function is invoked in parallel on multiple elements of the list. The Executor JVM executes the function in 12 threads as there are 12 cores. As the number of Cores was not set on the command line, the worker JVM by default set the option: -cores=12.

    9] Screen shot showing the client submitted code [maptodouble()] running in the remote forked Executor JVM.

    10] After all the tasks have been executed, the Executor JVM exits. After the client app exits, the worker node gets unblocked and waits for the next submission.

    References

    https://spark.apache.org/docs/latest/configuration.html

    I have created a Blog that describes the steps on how to debug these sub-systems. Hopefully, this helps others.

    Blog that outlines the steps:

    https://sandeepmspark.blogspot.com/2018/01/spark-standalone-cluster-internals.html

提交回复
热议问题