问题
I am working on a ubuntu server which has spark
installed in it.
I don't have sudo access to this server.
So under my directory, I created a new virtual environment
where I installed pyspark
When I type the below command
whereis spark-shell #see below
/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell
another command
echo 'sc.getConf.get("spark.home")' | spark-shell
scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
q1) Am I using the right commands to find the installation directory of spark?
q2) Can help me understand why do I see 3 opt paths
and 3 pyenv paths
回答1:
A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7
) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).
If you install pyspark
in a virtual environment, you're installing another instance of pyspark
which comes without the Scala source code but comes with compiled spark code as jars (see the jars
folder in your pyspark installation). pyspark
is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/
.
The scripts spark-shell2.cmd
and spark-shell.cmd
in the same directory are part of the same spark installation. These are text files and you can cat
them. You will see that spark-shell.cmd
calls spark-shell2.cmd
within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/
folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/
. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
seems like yet another pyspark
installation.
It shouldn't matter which pyspark
installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).
Typically, when you run a command like this
# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()
then you create a new java process that's runs spark.
If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark
then you also create a new java process.
You can check if there is indeed such a java process running using something like this: ps aux | grep "java"
.
来源:https://stackoverflow.com/questions/62537715/why-do-i-see-multiple-spark-installations-directories