Why do I see multiple spark installations directories?

北城余情 提交于 2020-07-10 10:27:08

问题


I am working on a ubuntu server which has spark installed in it.

I don't have sudo access to this server.

So under my directory, I created a new virtual environment where I installed pyspark

When I type the below command

whereis spark-shell   #see below


/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell

another command

echo 'sc.getConf.get("spark.home")' | spark-shell

scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark

q1) Am I using the right commands to find the installation directory of spark?

q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths


回答1:


A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).

If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.

The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.

It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).

Typically, when you run a command like this

# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate() 

then you create a new java process that's runs spark.

If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.

You can check if there is indeed such a java process running using something like this: ps aux | grep "java".



来源:https://stackoverflow.com/questions/62537715/why-do-i-see-multiple-spark-installations-directories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!