PySpark OutOfMemoryErrors when performing many dataframe joins

问题

There's many posts about this issue, but none have answered my question.

I'm running into OutOfMemoryErrors in PySpark while attempting to join many different dataframes together.

My local machine has 16GB of memory, and I've set my Spark configurations as such:

class SparkRawConsumer:

    def __init__(self, filename, reference_date, FILM_DATA):
        self.sparkContext = SparkContext(master='local[*]', appName='my_app')
        SparkContext.setSystemProperty('spark.executor.memory', '3g')
        SparkContext.setSystemProperty('spark.driver.memory', '15g')

There are clearly many, many SO posts about OOM errors in Spark, but basically most of them say to increase your memory properties.

I am essentially performing joins from 50-60 smaller dataframes, which have two columns uid, and data_in_the_form_of_lists (usually, it's a list of Python strings). My master dataframe that I am joining on has about 10 columns, but also contains a uid column (that I am joining on).

I'm only attempting to join 1,500 rows of data. However, I'll encounter frequent OutOfMemory errors, when clearly all this data can fit into memory. I confirm this by looking in my SparkUI at my Storage:

In code, my joins look like this:

# lots of computations to read in my dataframe and produce metric1, metric2, metric3, .... metric 50
metrics_df = metrics_df.join(
                self.sqlContext.createDataFrame(metric1, schema=["uid", "metric1"]), on="uid")

metrics_df.count()
metrics_df.repartition("gid_value")
metrics_df = metrics_df.join(
                self.sqlContext.createDataFrame(metric2, schema=["uid", "metric2"]),
                on="gid_value")

metrics_df.repartition("gid_value")
metrics_df = metrics_df.join(
                self.sqlContext.createDataFrame(metric3, schema=["uid", "metric3"]),
                on="uid")

metrics_df.count()
metrics_df.repartition("gid_value")

Where metric1, metric2 and metric3 are RDDs that I convert into dataframes prior to the join (keep in mind there's actually 50 of these smaller metric dfs I am joining).

I call metric.count() to force evaluation, since it seemed to help prevent the memory errors (I would get many more driver errors when attempting the final collect otherwise).

The errors are non-deterministic. I don't see them occurring at any particular spot in my joins consistently, and sometimes appears to be occurring my final metrics_df.collect() call, and sometimes during the smaller joins.

I really suspect there's some issues with task serialization/deserialization. For instance, when I look at my Event Timeline for a typical stage, I see that the bulk of it is taken up by task deserialization:

I also notice that there's a huge number for garbage collection time:

Is garbage collection the issue in causing the memory errors? Or is it task serialization?

Edit to answer comment questions

I've been running the Spark job as part of a larger PyCharm project (hence why the spark context was wrapped around a class). I refactored the code to run it as a script, using the following spark submit:

spark-submit spark_consumer.py \
  --driver-memory=10G \
  --executor-memory=5G \
  --conf spark.executor.extraJavaOptions='-XX:+UseParallelGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'

回答1:

I faced similar issue and it worked with:
Spark Submit:

spark-submit --driver-memory 3g\
            --executor-memory 14g\
            *.py

Code:

sc = SparkContext().getOrCreate()

来源：https://stackoverflow.com/questions/51310952/pyspark-outofmemoryerrors-when-performing-many-dataframe-joins

标签

python

apache-spark

pyspark