Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

后端 未结 2 499
时光取名叫无心
时光取名叫无心 2021-02-06 10:47

I am making a memory-based real-time calculation module of \"Big data\" using Pandas module of the Python environment.

So response time is the quality of this module and

2条回答
  •  轮回少年
    2021-02-06 11:28

    Inspired by this question and @unutbu's answer, I wrote a parallel-version of map at github. The function is suitable for infinitely parallelizable processing of a read-only large data structure in a single machine with multiple cores. The basic idea is similar to @unutbu sugggested, using a temporary global variable to hold the big data structure (e.g., a data frame), and pass its "name" rather than the variable itself to workers. But all of this are encapsulated in a map function so that it is almost a drop-in replacement of the standard map function, with the help of pathos package. The example usage is as follows,

    # Suppose we process a big dataframe with millions of rows.
    size = 10**9
    df = pd.DataFrame(np.random.randn(size, 4),
                      columns=['column_01', 'column_02', 
                               'column_03', 'column_04'])
    # divide df into sections of 10000 rows; each section will be
    # processed by one worker at a time
    section_size = 10000
    sections = [xrange(start, start+section_size) 
                for start in xrange(0, size, section_size)]
    
    # The worker function that processes one section of the
    # df. The key assumption is that a child 
    # process does NOT modify the dataframe, but do some 
    # analysis or aggregation and return some result.
    def func(section, df):
        return some_processing(df.iloc[section])
    
    num_cores = 4
    # sections (local_args) specify which parts of a big object to be processed;
    # global_arg holds the big object to be processed to avoid unnecessary copy;
    # results are a list of objects each of which is the processing results 
    # of one part of a big object (i.e., one element in the iterable sections) 
    # in order.
    results = map(func, sections, global_arg=df,
                  chunksize=10, 
                  processes=num_cores)
    
    # reduce results (assume it is a list of data frames)
    result = pd.concat(results)
    

    In some of my text mining tasks, naive parallel implementation that passes df directly to the worker function is even slower than the single-threaded version, due to expensive copy operation of large data frame. However, the above implementation can give >3 times speedup for those tasks with 4 cores, which seems pretty close to real light-weight multi-threading.

提交回复
热议问题