AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

前端 未结 2 673
醉梦人生
醉梦人生 2020-11-30 13:09

I am running into this problem w/ Apache Arrow Spark Integration.

Using AWS EMR w/ Spark 2.4.3

Tested this problem on both local spark single machine instanc

2条回答
  •  盖世英雄少女心
    2020-11-30 14:09

    In EMR python3 is not resolved by default. You have to make it explicit. One way to do it is to pass a config.json file as you're creating the cluster. It's available in the Edit software settings section in AWS EMR UI. A sample json file looks something like this.

    [
      {
        "Classification": "spark-env",
        "Configurations": [
          {
            "Classification": "export",
            "Properties": {
              "PYSPARK_PYTHON": "/usr/bin/python3"
            }
          }
        ]
      },
      {
        "Classification": "yarn-env",
        "Properties": {},
        "Configurations": [
          {
            "Classification": "export",
            "Properties": {
              "PYSPARK_PYTHON": "/usr/bin/python3"
            }
          }
        ]
      }
    ]
    

    Also you need to have the pyarrow module installed in all core nodes, not only in the master. For that you can use a bootstrap script while creating the cluster in AWS. Again, a sample bootstrap script can be as simple as something like this:

    #!/bin/bash
    sudo python3 -m pip install pyarrow==0.13.0
    

提交回复
热议问题