Parallelizing comparisons between two dataframes with multiprocessing

做~自己de王妃 提交于 2021-02-10 15:57:04

问题


I've got the following function that allows me to do some comparison between the rows of two dataframes (data and ref)and return the index of both rows if there's a match.

def get_gene(row):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene.

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)


def apply_by_multiprocessing(df, func, **kwargs):

    workers = kwargs.pop('workers')
    pool = multiprocessing.Pool(processes=workers)

    result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])

    pool.close()

    df = pd.concat(list(result))

    return df

It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data split in 4) versus 20K rows (ref).

What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':

  • data.get_group(['a']) versus ref.get_group(['a'])

  • data.get_group(['b']) versus ref.get_group(['b'])

  • data.get_group(['c']) versus ref.get_group(['c'])

  • etc...

which would reduce the amount of computation to do. Each row in data would only be able to be matched against ~3K rows in ref, instead of all 20K rows.

Therefore, I tried to modify the code above but I couldn't manage to make it work.

def apply_get_gene(df, func, **kwargs):

    reference = pd.read_csv('genomic_positions.csv', index_col=0)
    reference = reference.groupby(['Chr'])

    df = df.groupby(['Chr'])
    chromosome = df.groups.keys()



    workers = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=workers)


    args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]

    results = pool.map(_apply_df, args_list)

    pool.close()                                                          
    pool.join()                                                           

    return pd.concat(results)


def _apply_df(args):

    df, func, kwarg1, kwarg2 = args

    return df.apply(func, **kwargs)


def get_gene(row, ref):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

I'm pretty sure it has to do with the way of how *args and **kwargs are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref dataframe with the splitted data dataframe..). I think the problem lies within the function _apply_df. I thought I understood what it really does but the line df, func, kwargs = args is still bugging me and I think I failed to modify it correctly..

All advices are appreciated !


回答1:


Take a look at starmap():

starmap(func, iterable[, chunksize]) Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.

Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].

Which seems to be exactly what you need.




回答2:


I post the answer I came up with for readers who might stumble upon this post:

As noted by @Michele Tonutti, I just had to use starmap() and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene with the setting axis=1 but there's probably a way to make it more flexible if needed.

def Detect_gene(data):

    reference = pd.read_csv('genomic_positions.csv', index_col=0)
    ref = reference.groupby(['Chr'])

    df = data.groupby(['Chr'])
    chromosome = df.groups.keys()

    workers = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=workers)


    args = [(df.get_group(chrom), ref.get_group(chrom)) 
            for chrom in chromosome]

    results = pool.starmap(apply_get_gene, args)

    pool.close()                                                          
    pool.join()                                                           

    return pd.concat(results)


def apply_get_gene(df, a):

    return df.apply(get_gene, axis=1, ref=a)


def get_gene(row, ref):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.



来源:https://stackoverflow.com/questions/51948034/parallelizing-comparisons-between-two-dataframes-with-multiprocessing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!