How to solve memory issues problems while multiprocessing using Pool.map()?

前端 未结 4 1098
猫巷女王i
猫巷女王i 2020-12-12 15:41

I have written the program (below) to:

  • read a huge text file as pandas dataframe
  • then groupby using a specific column value
4条回答
  •  孤街浪徒
    2020-12-12 16:24

    GENERAL ANSWER ABOUT MEMORY WITH MULTIPROCESSING

    You asked: "What is causing so much memory to be allocated". The answer relies on two parts.

    First, as you already noticed, each multiprocessing worker gets it's own copy of the data (quoted from here), so you should chunk large arguments. Or for large files, read them in a little bit at a time, if possible.

    By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

    This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.

    Second, if you're trying to reclaim memory, you need to understand that python works differently than other languages, and you are relying on del to release the memory when it doesn't. I don't know if it's best, but in my own code, I've overcome this be reassigning the variable to a None or empty object.

    FOR YOUR SPECIFIC EXAMPLE - MINIMAL CODE EDITING

    As long as you can fit your large data in memory twice, I think you can do what you are trying to do by just changing a single line. I've written very similar code and it worked for me when I reassigned the variable (vice call del or any kind of garbage collect). If this doesn't work, you may need to follow the suggestions above and use disk I/O:

        #### earlier code all the same
        # clear memory by reassignment (not del or gc)
        gen_matrix_df = {}
    
        '''Now, pipe each dataframe from the list using map.Pool() '''
        p = Pool(3)  # number of pool to run at once; default at 1
        result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
    
        #del gen_matrix_df_list  # I suspect you don't even need this, memory will free when the pool is closed
    
        p.close()
        p.join()
        #### later code all the same
    

    FOR YOUR SPECIFIC EXAMPLE - OPTIMAL MEMORY USAGE

    As long as you can fit your large data in memory once, and you have some idea of how big your file is, you can use Pandas read_csv partial file reading, to read in only nrows at a time if you really want to micro-manage how much data is being read in, or a [fixed amount of memory at a time using chunksize], which returns an iterator5. By that I mean, the nrows parameter is just a single read: you might use that to just get a peek at a file, or if for some reason you wanted each part to have exactly the same number of rows (because, for example, if any of your data is strings of variable length, each row will not take up the same amount of memory). But I think for the purposes of prepping a file for multiprocessing, it will be far easier to use chunks, because that directly relates to memory, which is your concern. It will be easier to use trial & error to fit into memory based on specific sized chunks than number of rows, which will change the amount of memory usage depending on how much data is in the rows. The only other difficult part is that for some application specific reason, you're grouping some rows, so it just makes it a little bit more complicated. Using your code as an example:

       '''load the genome matrix file onto pandas as dataframe.
        This makes is more easy for multiprocessing'''
    
        # store the splitted dataframes as list of key, values(pandas dataframe) pairs
        # this list of dataframe will be used while multiprocessing
        #not sure why you need the ordered dict here, might add memory overhead
        #gen_matrix_df_list = collections.OrderedDict()  
        #a defaultdict won't throw an exception when we try to append to it the first time. if you don't want a default dict for some reason, you have to initialize each entry you care about.
        gen_matrix_df_list = collections.defaultdict(list)   
        chunksize = 10 ** 6
    
        for chunk in pd.read_csv(genome_matrix_file, sep='\t', names=header, chunksize=chunksize)
            # now, group the dataframe by chromosome/contig - so it can be multiprocessed
            gen_matrix_df = chunk.groupby('CHROM')
            for chr_, data in gen_matrix_df:
                gen_matrix_df_list[chr_].append(data)
    
        '''Having sorted chunks on read to a list of df, now create single data frames for each chr_'''
        #The dict contains a list of small df objects, so now concatenate them
        #by reassigning to the same dict, the memory footprint is not increasing 
        for chr_ in gen_matrix_df_list.keys():
            gen_matrix_df_list[chr_]=pd.concat(gen_matrix_df_list[chr_])
    
        '''Now, pipe each dataframe from the list using map.Pool() '''
        p = Pool(3)  # number of pool to run at once; default at 1
        result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values()))
        p.close()
        p.join()
    

提交回复
热议问题