Can PySpark work without Spark?

自作多情 提交于 2020-07-06 08:56:13

问题


I have installed PySpark standalone/locally (on Windows) using

pip install pyspark

I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).

Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:

  • what is the exact connection between these two technologies?
  • why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
  • if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)

回答1:


As of v2.2, executing pip install pyspark will install Spark.

If you're going to use Pyspark it's clearly the simplest way to get started.

On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars




回答2:


PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.



来源:https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!