Why do I see multiple spark installations directories?

问题

I am working on a ubuntu server which has spark installed in it.

I don't have sudo access to this server.

So under my directory, I created a new virtual environment where I installed pyspark

When I type the below command

whereis spark-shell   #see below


/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell

another command

echo 'sc.getConf.get("spark.home")' | spark-shell

scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark

q1) Am I using the right commands to find the installation directory of spark?

q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths

回答1:

A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).

If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.

The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.

It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).

Typically, when you run a command like this

# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()

then you create a new java process that's runs spark.

If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.

You can check if there is indeed such a java process running using something like this: ps aux | grep "java".

来源：https://stackoverflow.com/questions/62537715/why-do-i-see-multiple-spark-installations-directories

标签

python

scala

apache-spark

pyspark

apache-spark-sql