Amazon EMR Pyspark Module not found

后端 未结 3 1168
故里飘歌 2021-02-20 04:32

I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.

I uploaded a

  • 2021-02-20 05:17

    I add the following lines to ~/.bashrc for emr 4.3:

    export SPARK_HOME=/usr/lib/spark
    export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

    Here is the py4j file in your spark python library folder. Search /usr/lib/spark/python/lib/ to find the exact version and replace the XXX with that version number.

    Run source ~/.bashrc and you should be good.

    0 讨论(0)
  • 2021-02-20 05:30

    Try using findspark: Install via shell using pip install findspark.

    Sample code:

    # Import package(s).
    import findspark
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    0 讨论(0)
  • 2021-02-20 05:33

    You probably need to add the pyspark files to the path. I typically use a function like the following.

    def configure_spark(spark_home=None, pyspark_python=None):
        spark_home = spark_home or "/path/to/default/spark/home"
        os.environ['SPARK_HOME'] = spark_home
        # Add the PySpark directories to the Python path:
        sys.path.insert(1, os.path.join(spark_home, 'python'))
        sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
        sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))
        # If PySpark isn't specified, use currently running Python binary:
        pyspark_python = pyspark_python or sys.executable
        os.environ['PYSPARK_PYTHON'] = pyspark_python

    Then, you can call the function before importing pyspark:

    from pyspark import SparkContext

    Spark home on an EMR node should be something like /home/hadoop/spark. See for more details.

    0 讨论(0)