Why does memory consumption increase dramatically in `Pool.map()` multiprocessing?

雨燕双飞 提交于 2019-12-07 11:41:53

问题


I am doing a multiprocessing on a pandas dataframe by splitting it into several dataframes, which are stored as list. And, using Pool.map() I am passing the dataframe to a defined function. My input file is about "300 mb", so small dataframes are roughly "75 mb". But, when the multiprocessing is running the memory consumption increases by 7 GB and each local process consumes about approx. 2 GB of memory. Why is this happening?

def main():

    my_df = pd.read_table("my_file.txt", sep="\t")
    my_df = my_df.groupby('someCol')

    my_df_list = []
    for colID, colData in my_df:
        my_df_list.append(colData)

    # now, multiprocess each small dataframe individually    
    p = Pool(3)
    result = p.map(process_df, my_df_list)

    p.close()
    p.join()

    print('Global maximum memory usage: %.2f (mb)' % current_mem_usage())

    result_merged = pd.concat(result)

    # write merged data to file


def process_df(my_df):
    my_new_df = do something with "my_df"

    print('\tWorker maximum memory usage: %.2f (mb)' % (current_mem_usage()))

    del my_df
    return my_new_df


#to monitor memory usage
def current_mem_usage():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024.

My results are good but memory consumption is quite high for each 75 mb file. Why so ? Is it a leak? What are the possible remedies?

Output of the memory usage:

Worker maximum memory usage: 2182.84 (mb)
Worker maximum memory usage: 2182.84 (mb)
Worker maximum memory usage: 2837.69 (mb)
Worker maximum memory usage: 2849.84 (mb)
Global maximum memory usage: 3106.00 (mb)

来源:https://stackoverflow.com/questions/49475489/why-does-memory-consumption-increase-dramatically-in-pool-map-multiprocessin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!