Link Spark with iPython Notebook

我们两清 提交于 2019-11-26 20:17:41

I have Jupyter installed, and indeed It is simpler than you think:

  1. Install anaconda for OSX.
  2. Install jupyter typing the next line in your terminal Click me for more info.

    ilovejobs@mymac:~$ conda install jupyter
    
  3. Update jupyter just in case.

    ilovejobs@mymac:~$ conda update jupyter
    
  4. Download Apache Spark and compile it, or download and uncompress Apache Spark 1.5.1 + Hadoop 2.6.

    ilovejobs@mymac:~$ cd Downloads 
    ilovejobs@mymac:~/Downloads$ wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
    
  5. Create an Apps folder on your home (i.e):

    ilovejobs@mymac:~/Downloads$ mkdir ~/Apps
    
  6. Move the uncompressed folder spark-1.5.1 to the ~/Apps directory.

    ilovejobs@mymac:~/Downloads$ mv spark-1.5.1/ ~/Apps
    
  7. Move to the ~/Apps directory and verify that spark is there.

    ilovejobs@mymac:~/Downloads$ cd ~/Apps
    ilovejobs@mymac:~/Apps$ ls -l
    drwxr-xr-x ?? ilovejobs ilovejobs 4096 ?? ?? ??:?? spark-1.5.1
    
  8. Here is the first tricky part. Add the spark binaries to your $PATH:

    ilovejobs@mymac:~/Apps$ cd
    ilovejobs@mymac:~$ echo "export $HOME/apps/spark/bin:$PATH" >> .profile
    
  9. Here is the second tricky part. Add this environment variables also:

    ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON=ipython" >> .profile
    ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile
    
  10. Source the profile to make these variables available for this terminal

    ilovejobs@mymac:~$ source .profile
    
  11. Create a ~/notebooks directory.

    ilovejobs@mymac:~$ mkdir notebooks
    
  12. Move to ~/notebooks and run pyspark:

    ilovejobs@mymac:~$ cd notebooks
    ilovejobs@mymac:~/notebooks$ pyspark
    

Notice that you can add those variables to the .bashrc located in your home. Now be happy, You should be able to run jupyter with a pyspark kernel (It will show it as a python 2 but it will use spark)

First, make sure you have got a spark enviornment in your machine.

Then, install a python module findspark via pip:

$ sudo pip install findspark

And then in the python shell:

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Now you can do what you want with pyspark in the python shell(or in ipython).

Actually it's the easiest way in my view to use spark kernel in the jupyter

FYI, you can run Scala, PySpark, SparkR, and SQL with Spark running on top of Jupyter via https://github.com/ibm-et/spark-kernel now. The new interpreters were added (and marked experimental) from pull request https://github.com/ibm-et/spark-kernel/pull/146.

See the language support wiki page for more information.

Spark with IPython/Jupyter notebook is great and I'm pleased the Alberto was able to help you get it working.

For reference it's also worth considering 2 good alternatives that come prepackaged and can easily be integrated into a YARN cluster (if desired.)

Spark Notebook: https://github.com/andypetrella/spark-notebook

Apache Zeppelin: https://zeppelin.incubator.apache.org/

At the time of writing Spark Notebook (v0.6.1) is more mature and you can prebuild an install against your Spark and Hadoop version here: http://spark-notebook.io/

Zeppelin (v0.5) looks very promising but doesn't offer as much functionality as Spark Notebook or IPython with Spark right now.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!