When do binaryFiles load into memory when mapPartitions is used?

寵の児 提交于 2019-12-13 15:59:41

问题


I am using PySpark to apply a trained deep learning model to images and am concerned with how memory usage will scale with my current approach. Because the trained model takes a while to load, I process large batches of images on each worker with code similar to the following:

def run_eval(file_generator):
    trained_model = load_model()
    results = []
    for file in file_generator:
        # "file" is a tuple: [0] is its filename, [1] is the byte data
        results.append(trained_model.eval(file[1]))
    return(results)

my_rdd = sc.binaryFiles('adl://my_file_path/*.png').repartition(num_workers)
results = my_rdd.mapPartitions(run_eval)
results.collect()

As noted above, the files are stored on an associated HDFS file system (specifically, an Azure Data Lake Store) which can be accessed through the SparkContext.

My main questions are:

  • When is the image data being loaded into memory?
    • Is each image's data loaded when the generator increments ("just in time")?
    • Is all image data for the whole partition loaded before the worker starts?
  • Is the head node responsible for loading the data from this associated file system (potentially creating a bottleneck), or do workers load their own data from it?

Also appreciate your advice on where to find these topics covered in depth.


回答1:


When is the image data being loaded into memory?

  • Is each image's data loaded when the generator increments ("just in time")?

Actually, given your code, it has to be loaded more than once. First it accessed by the JVM and then converted to a Python types. After that shuffle occurs and data is loaded once again. Each process is lazy so loading is not an issue.

So the first question you have to ask yourself is if you really have to shuffle. binaryFiles has minPartitions argument which can be used to control the number of partitions.

Another problem is non-lazy results list. It would make much more sense to use a generator expression:

def run_eval(file_generator):
    trained_model = load_model()
    for file in file_generator:
        yield trained_model.eval(file[1])

Is the head node responsible for loading the data from this associated file system (potentially creating a bottleneck), or do workers load their own data from it?

There is no central processing involved. Each executor process (Python) / thread (JVM) will load its own part of the dataset.



来源:https://stackoverflow.com/questions/41767986/when-do-binaryfiles-load-into-memory-when-mappartitions-is-used

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!