Piping Scala RDD to Python code fails

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 16:31:43

问题


I am trying to execute Python code inside Scala program passing RDD as data to the Python script. The Spark cluster is initialized successfully, the data conversion to RDD is fine and running the Python script separately(outside Scala code) works. However, the execution of the same Python script inside Scala fails with:

java.lang.IllegalStateException: Subprocess exited with status 2. Command ran: /{filePath}/{File}.py

Looking deeper, it shows import: command not found when trying to execute the Python file. I believe the built-in Scala executor cannot understand that this is a Python script.

Note: My environment variables are set correctly and I can access and execute Python scripts from everywhere else. In place of filePath and File I have the actual path to that file and the file name.

Environment: Spark 2.2.1 Scala 2.11.11 Python 2.7.10

Code:

val conf=new SparkConf().setAppName("Test").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val distScript = "/{filePath}/{File}.py"
val distScriptName = "{File}.py"
sparkContext.addFile(distScript)
val ipData = sparkContext.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val pipeRDD = ipData.pipe(SparkFiles.get(distScriptName))

pipeRDD.foreach(println)

Have someone tried this before and is able to help resolving the issue I am getting? Is this the best way to integrate Scala and Python scripts? I am open for some other verified recommendations and suggestions to try.


回答1:


I found where the issue came from. I was missing the following line in the Python file:

#!/usr/bin/python

After adding it the java.lang.IllegalStateException disappeared.



来源:https://stackoverflow.com/questions/48832630/piping-scala-rdd-to-python-code-fails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!