How to set Jupyter notebook to Python3 instead of Python2.7 in AWS EMR

大兔子大兔子 提交于 2021-02-08 10:58:30

问题


I am spinning up an EMR in AWS. The difficulty arises when using Jupyter to import associated Python modules. I have a shell script that executes when the EMR starts and imports Python modules.

The notebook is set to run using the PySpark Kernel.

I believe the problem is that the Jupyter notebook is not pointed to the correct Python in EMR. The methods I have used to set the notebook to the correct version do not seem to work.

I have set the following configurations. I have tried changing python to python3.6 and python3.

Configurations=[{
    "Classification": "spark-env",
    "Properties": {},
    "Configurations": [{
        "Classification": "export",
        "Properties": {
            "PYSPARK_PYTHON": "python",
            "PYSPARK_DRIVER_PYTHON": "python",
            "SPARK_YARN_USER_ENV": "python"
        }
    }]

I am certain that my shell script is importing the modules because when I run the following on the EMR command line (via SSH) it works:

python3.6
import boto3

However when I run the following, it does not work:

python
import boto3

Traceback (most recent call last): File "", line 1, in ImportError: No module named boto3

When I run the following command in Jupyter I get the output below:

import sys
import os

print(sys.version)

2.7.16 (default, Jul 19 2019, 22:59:28) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

#!/bin/bash
alias python=python3.6
export PYSPARK_DRIVER_PYTHON="python"
export SPARK_YARN_USER_ENV="python"
sudo python3 -m pip install boto3
sudo python3 -m pip install pandas
sudo python3 -m pip install pymysql
sudo python3 -m pip install xlrd
sudo python3 -m pip install pymssql

When I attempt to import boto3 I get an error message using Jupyter:

No module named boto3 Traceback (most recent call last): ImportError: No module named boto3


回答1:


If you want to use Python3 with EMR notebooks, the recommended way is to use pyspark kernel and configure Spark to use Python3 within the notebook as,

%%configure -f {"conf":{ "spark.pyspark.python": "python3" }}

Note that,

  • Any on cluster configuration related to PYSPARK_PYTHON or PYSPARK_PYTHON_DRIVER is overridden by EMR notebook configuration. The only way to configure for Python3 is from within the notebook as mentioned above.

  • pyspark3 kernel is deprecated for Livy 4.0+, and henceforth pyspark kernel is recommended to be used for both Python2 and Python3 by configuring spark.pyspark.python accordingly.

  • If you want to install additional Python dependencies which are not already present on the cluster, you can use notebook-scoped libraries. It works for both Python2 as well as Python3.



来源:https://stackoverflow.com/questions/57512577/how-to-set-jupyter-notebook-to-python3-instead-of-python2-7-in-aws-emr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!