How can I debug spark application locally?

那年仲夏 提交于 2019-11-28 04:07:03

As David Griffin mentioned, using spark-shell can be very helpful. However, I find that doing actual local debugging, setting break points, inspecting variables, etc. is indispensable. Here's how I do it using IntelliJ.

First, make sure you can run your spark application locally using spark-submit, e.g. something like:

spark-submit --name MyApp --class MyMainClass --master local[2] myapplication.jar

Then, tell your local spark driver to pause and wait for a connection from a debugger when it starts up, by adding an option like the following:

--conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

where agentlib:jdwp is the Java Debug Wire Protocol option, followed by a comma-separated list of sub-options:

  • transport defines the connection protocol used between debugger and debuggee -- either socket or "shared memory" -- you almost always want socket (dt_socket) except I believe in some cases on Microsoft Windows
  • server whether this process should be the server when talking to the debugger (or conversely, the client) -- you always need one server and one client. In this case, we're going to be the server and wait for a connection from the debugger
  • suspend whether to pause execution until a debugger has successfully connected. We turn this on so the driver won't start until the debugger connects
  • address here, this is the port to listen on (for incoming debugger connection requests). You can set it to any available port (you just have to make sure the debugger is configured to connect to this same port)

So now, your spark-submit command line should look something like:

spark-submit --name MyApp --class MyMainClass --master local[2] --conf spark.driver.extraJavaOptions=agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

Now if you run the above, you should see something like

Listening for transport dt_socket at address: 5005

and your spark application is waiting for the debugger to attach.

Next, open the IntelliJ project containing your Spark application, and then open "Run -> Edit Configurations..." Then click the "+" to add a new run/debug configuration, and select "Remote". Give it a name, e.g. "SparkLocal", and select "Socket" for Transport, "Attach" for Debugger mode, and type in "localhost" for Host and the port you used above for Port, in this case, "5005". Click "OK" to save.

In my version of IntelliJ it gives you suggestions for the debug command line to use for the debugged process, and it uses "suspend=n" -- we're ignoring that and using "suspend=y" (as above) because we want the application to wait until we connect to start.

Now you should be ready to debug. Simply start spark with the above command, then select the IntelliJ run configuration you just created and click Debug. IntelliJ should connect to your Spark application, which should now start running. You can set break points, inspect variables, etc.

Fire up the Spark shell. This is straight from the Spark documentation:

./bin/spark-shell --master local[2]

You will also see the Spark shell referred to as the REPL. It is by far the best way to learn Spark. I spend 80% of my time in the Spark shell and the other 20% translating the code into my application.

Just pass java options to open debug port. Here is nice article addressing your question - http://danosipov.com/?p=779 I'm using it like

$ SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 spark-shell

(yes, SPARK_JAVA_OPTS is deprecated, but it works fine)

@Jason Evans's answer did not work for me. But

--conf spark.driver.extraJavaOptions=-Xrunjdwp:transport=dt_socket,server=y,address=8086,suspend=n

worked

only one minor change is needed for @Jason Evan's answer. It needs a ‘-’ before the String "agentlib...."

 --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

you might also use the option "--driver-java-options" to achieve the same purpose

--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

Firstly pick a version of spark then pick an IDE, intellij would be better. Checkout the source code of this spark version and make sure you can successfully build it from IDE(more here). once you have a clean build, search for the test cases or test suites. For example "SubquerySuite" is a simple one then debug it like a normal application. Comment about any specifics steps you need help with

Ingram

you can try this in spark-env.sh:

SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8888

Here is how to get everything on the console:

First check here to see what level of info you want spark (log4j) to print on your console:

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-logging.html

Then submit your command as follows:

path/to/spark-submit
  --master local[a number of cores here from your CPU]
  --driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties 
  -Dvm.logging.level=ALL"

"ALL" will give you all the info you can get. Also it does not matter if spark does not find your log4.properties file, it should load the setting of your desired logging level and the info will print on your screen.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!