Run Pyspark and Kafka in Jupyter Notebook

时光毁灭记忆、已成空白 提交于 2019-12-06 16:47:33

As @user6910411 said PYSPARK_SUBMIT_ARGS can only work before the instantiation of your sparkContext.

In the example you followed, they probably use a python Kernel for their jupyter notebook and they instantiate a spark context using the pyspark library.

I'm guessing you're using a pyspark kernel, hence:

spark = SparkSession\
    .builder\
    .appName("StructuredKafkaWordCount")\
    .getOrCreate()

won't start a sparkSession but only fetch the already existing one.

You can pass arguments to the spark-submit ran by jupyter in your kernel.json file so the libraries get loaded every time you run a new notebook:

{
    "display_name": "PySpark",
    "language": "python",
    "argv": [ "/opt/anaconda3/bin/python", "-m", "ipykernel", "-f", "  {connection_file}" ],
    "env": {
        "SPARK_HOME": "/usr/iop/current/spark-client",
        "PYSPARK_PYTHON": "/opt/anaconda3/bin/python3",
        "PYTHONPATH": "/usr/iop/current/spark-client/python/:/usr/iop/current/spark-client/python/lib/py4j-0.9-src.zip",
        "PYTHONSTARTUP": "/usr/iop/current/spark-client/python/pyspark/shell.py",
        "PYSPARK_SUBMIT_ARGS":  "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell"
  }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!