I'm trying to use Livy to remotely submit several Spark jobs. Lets say I want to perform following spark-submit task remotely (with all the options as-such)
spark-submit \
--class com.company.drivers.JumboBatchPipelineDriver \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.serializer='org.apache.spark.serializer.KryoSerializer' \
--conf "spark.executor.extraJavaOptions= -XX:+UseG1GC" \
--master yarn \
--deploy-mode cluster \
/home/hadoop/y2k-shubham/jars/jumbo-batch.jar \
\
--start=2012-12-21 \
--end=2012-12-21 \
--pipeline=db-importer \
--run-spiders
NOTE: The options after the JAR (--start, --end etc.) are specific to my Spark application. I'm using scopt for this
I'm aware that I can supply all the various options in above
spark-submitcommand usingLivyPOST/batchesrequest.But since I have to make over 250
spark-submits remotely, I'd like to exploitLivy's session-management capabilities; i.e., I wantLivyto create aSparkSessiononce and then use it for all myspark-submitrequests.The
POST/sessionsrequest allows me to specify quite a few options for instantiating aSparkSessionremotely. However, I see nosessionargument inPOST/batchesrequest.
How can I make use of the SparkSession that I created using POST/sessions request for submitting my Spark job using POST/batches request?
I've referred to following examples but they only demonstrate supplying (python) code for Spark job within Livy's POST request
How can I make use of the
SparkSessionthat I created usingPOST/sessionsrequest for submitting mySparkjob usingPOST/batchesrequest?
- At this stage, I'm all but certain that this is not possible right now
- @Luqman Ghani's comment gives a fairly good hint that
batch-mode is intended for different use-case thansession-mode /LivyClient
The reason I've identified why this isn't possible is (please correct me if I'm wrong / incomplete) as follows
POST/batchesrequest acceptsJAR- This inhibits
SparkSession(orspark-shell) from being re-used (without restarting theSparkSession) because- How would you remove
JARfrom previousPOST/batchesrequest? - How would you add
JARfrom currentPOST/batchesrequest?
- How would you remove
And here's a more complete picture
- Actually
POST/sessionsallows you to pass aJAR - but then further interactions with that
session(obviously) cannot takeJARs - they (further interactions) can only be simple scripts (like
PySpark: simplepythonfiles) that can be loaded into thesession(and notJARs)
Possible workaround
- All those who have their
Spark-application written inScala/Java, which must be bundled in aJAR, will face this difficulty;Python(PySpark) users are lucky here - As a possible workaround, you can try this (i see no reason why it wouldn't work)
- launch a
sessionwith yourJARviaPOST/sessionsrequest - then invoke the entrypoint-
classfrom yourJARviapython(submitPOST /sessions/{sessionId}/statements) as many times as you want (with possibly different parameters). While this wouldn't be straight-forward, it sounds very much possible
- launch a
Finally I found some more alternatives to Livy for remote spark-submit; see this
来源:https://stackoverflow.com/questions/51746286/use-existing-sparksession-in-post-batches-request