Python: using multiprocessing on a pandas dataframe

前端 未结 2 1252
逝去的感伤
逝去的感伤 2020-12-12 14:15

I want to use multiprocessing on a large dataset to find the distance between two gps points. I constructed a test set, but I have been unable to get mult

2条回答
  •  Happy的楠姐
    2020-12-12 14:29

    What's wrong

    This line from your code:

    pool.map(calc_dist, ['lat','lon'])
    

    spawns 2 processes - one runs calc_dist('lat') and the other runs calc_dist('lon'). Compare the first example in doc. (Basically, pool.map(f, [1,2,3]) calls f three times with arguments given in the list that follows: f(1), f(2), and f(3).) If I'm not mistaken, your function calc_dist can only be called calc_dist('lat', 'lon'). And it doesn't allow for parallel processing.

    Solution

    I believe you want to split the work between processes, probably sending each tuple (grp, lst) to a separate process. The following code does exactly that.

    First, let's prepare for splitting:

    grp_lst_args = list(df.groupby('co_nm').groups.items())
    
    print(grp_lst_args)
    [('aa', [0, 1, 2]), ('cc', [7, 8, 9]), ('bb', [3, 4, 5, 6])]
    

    We'll send each of these tuples (here, there are three of them) as an argument to a function in a separate process. We need to rewrite the function, let's call it calc_dist2. For convenience, it's argument is a tuple as in calc_dist2(('aa',[0,1,2]))

    def calc_dist2(arg):
        grp, lst = arg
        return pd.DataFrame(
                   [ [grp,
                      df.loc[c[0]].ser_no,
                      df.loc[c[1]].ser_no,
                      vincenty(df.loc[c[0], ['lat','lon']], 
                               df.loc[c[1], ['lat','lon']])
                     ]
                     for c in combinations(lst, 2)
                   ],
                   columns=['co_nm','machineA','machineB','distance'])
    

    And now comes the multiprocessing:

    pool = mp.Pool(processes = (mp.cpu_count() - 1))
    results = pool.map(calc_dist2, grp_lst_args)
    pool.close()
    pool.join()
    
    results_df = pd.concat(results)
    

    results is a list of results (here data frames) of calls calc_dist2((grp,lst)) for (grp,lst) in grp_lst_args. Elements of results are later concatenated to one data frame.

    print(results_df)
      co_nm  machineA  machineB          distance
    0    aa         1         2  156.876149391 km
    1    aa         1         3  313.705445447 km
    2    aa         2         3  156.829329105 km
    0    cc         8         9  156.060165391 km
    1    cc         8         0  311.910998169 km
    2    cc         9         0  155.851498134 km
    0    bb         4         5  156.665641837 km
    1    bb         4         6  313.214333025 km
    2    bb         4         7  469.622535339 km
    3    bb         5         6  156.548897414 km
    4    bb         5         7  312.957597466 km
    5    bb         6         7   156.40899677 km
    

    BTW, In Python 3 we could use a with construction:

    with mp.Pool() as pool:
        results = pool.map(calc_dist2, grp_lst_args)
    

    Update

    I tested this code only on linux. On linux, the read only data frame df can be accessed by child processes and is not copied to their memory space, but I'm not sure how it exactly works on Windows. You may consider splitting df into chunks (grouped by co_nm) and sending these chunks as arguments to some other version of calc_dist.

提交回复
热议问题