When do binaryFiles load into memory when mapPartitions is used?

问题

I am using PySpark to apply a trained deep learning model to images and am concerned with how memory usage will scale with my current approach. Because the trained model takes a while to load, I process large batches of images on each worker with code similar to the following:

def run_eval(file_generator):
    trained_model = load_model()
    results = []
    for file in file_generator:
        # "file" is a tuple: [0] is its filename, [1] is the byte data
        results.append(trained_model.eval(file[1]))
    return(results)

my_rdd = sc.binaryFiles('adl://my_file_path/*.png').repartition(num_workers)
results = my_rdd.mapPartitions(run_eval)
results.collect()

As noted above, the files are stored on an associated HDFS file system (specifically, an Azure Data Lake Store) which can be accessed through the SparkContext.

My main questions are:

When is the image data being loaded into memory?
- Is each image's data loaded when the generator increments ("just in time")?
- Is all image data for the whole partition loaded before the worker starts?
Is the head node responsible for loading the data from this associated file system (potentially creating a bottleneck), or do workers load their own data from it?

Also appreciate your advice on where to find these topics covered in depth.

回答1:

When is the image data being loaded into memory?

Is each image's data loaded when the generator increments ("just in time")?

Actually, given your code, it has to be loaded more than once. First it accessed by the JVM and then converted to a Python types. After that shuffle occurs and data is loaded once again. Each process is lazy so loading is not an issue.

So the first question you have to ask yourself is if you really have to shuffle. binaryFiles has minPartitions argument which can be used to control the number of partitions.

Another problem is non-lazy results list. It would make much more sense to use a generator expression:

def run_eval(file_generator):
    trained_model = load_model()
    for file in file_generator:
        yield trained_model.eval(file[1])

Is the head node responsible for loading the data from this associated file system (potentially creating a bottleneck), or do workers load their own data from it?

There is no central processing involved. Each executor process (Python) / thread (JVM) will load its own part of the dataset.

来源：https://stackoverflow.com/questions/41767986/when-do-binaryfiles-load-into-memory-when-mappartitions-is-used

标签

apache-spark

pyspark

azure-data-lake