PySpark: ModuleNotFoundError: No module named 'app'

问题

I am saving a dataframe to a CSV file in PySpark using below statement:

df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')

But i am getting below error

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'

i am using PySpark version 2.3.0

I am getting this error while trying to write to a file.

    import json, jsonschema
    from pyspark.sql import functions
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType, FloatType
    from datetime import datetime
    import os

    feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
    apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)

    df_all = feb.union(apr)
    df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])

    create_emi_amount_udf = udf(create_emi_amount, FloatType())
    df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))

    df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')

回答1:

The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.

So, somewhere in your method create_emi_amount you use or import the app module. A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....

来源：https://stackoverflow.com/questions/56901591/pyspark-modulenotfounderror-no-module-named-app

标签

apache-spark

pyspark