Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

巧了我就是萌 提交于 2019-12-03 13:01:55

A simple reproduction scenario:

import time
from pyspark import SparkContext

sc = SparkContext()

def push_and_pop(rdd):
    # two transformations: moves the head element to the tail
    first = rdd.first()
    return rdd.filter(
        lambda obj: obj != first
    ).union(
        sc.parallelize([first])
    )

def serialize_and_deserialize(rdd):
    # perform a collect() action to evaluate the rdd and create a new instance
    return sc.parallelize(rdd.collect())

def do_test(serialize=False):
    rdd = sc.parallelize(range(1000))
    for i in xrange(25):
        t0 = time.time()
        rdd = push_and_pop(rdd)
        if serialize:
            rdd = serialize_and_deserialize(rdd)
        print "%.3f" % (time.time() - t0)

do_test()

Shows major slowdown during the 25 iterations:

0.597 0.117 0.186 0.234 0.288 0.309 0.386 0.439 0.507 0.529 0.553 0.586 0.710 0.728 0.779 0.896 0.866 0.881 0.956 1.049 1.069 1.061 1.149 1.189 1.201

(first iteration is relatively slow because of initialization effects, second iteration is quick, every subsequent iteration is slower).

The cause seems to be the growing chain of lazy transformations. We can test the hypothesis by rolling up the RDD using an action.

do_test(True)

0.897 0.256 0.233 0.229 0.220 0.238 0.234 0.252 0.240 0.267 0.260 0.250 0.244 0.266 0.295 0.464 0.292 0.348 0.320 0.258 0.250 0.201 0.197 0.243 0.230

The collect(), parallelize() adds about 0.1 second to each iteration, but completely eliminates the incremental slowdown.

I resolved this issue by saving the DataFrame to HDFS at the end of every iteration and reading it back from HDFS in the beginning of the next one.

Since I do that, the program runs as a breeze and doesn't show any signs of slowing down, overusing the memory or overloading the driver.

I still don't understand why this happens, so I'm leaving the question open.

Your code has the correct logic. It is just that you never call item_links.unpersist() so firstly it slows down (trying to do swapping with local disk) then OOM.

Memory usage in Ganglia may be misleading. It won't change since memory is allocated to executors at the start of the script, regardless if they use it or not later. You may check Spark UI for storage / executor status.

Try printing dataFrame.explain to see the logical plan. Every iteration the transformations on this Dataframe keeps on adding up to the logical plan, and so the evaluation time keeps on adding up.

You can use below solution as a workaround :

dataFRame.rdd.localCheckpoint()

This writes the RDDs for this DataFrame to memory and removes the lineages , and then created the RDD from the data written to the memory.

Good thing about this is that you dont need to write your RDD to HDFS or disk. However, this also brings some issues with it, which may or may not effect you. You can read the documentation of "localCheckPointing" method for details.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!