How to apply a function to mulitple columns of a pandas DataFrame in parallel

╄→гoц情女王★ 提交于 2019-12-23 05:26:30

问题


I have a pandas DataFrame with hundreds of thousands of rows, and I want to apply a time-consuming function on multiple columns of that DataFrame in parallel.

I know how to apply the function serially. For example:

import hashlib

import pandas as pd


df = pd.DataFrame(
    {'col1': range(100_000), 'col2': range(100_000, 200_000)},
    columns=['col1', 'col2'])


def foo(col1, col2):
    # This function is actually much more time consuming in real life
    return hashlib.md5(f'{col1}-{col2}'.encode('utf-8')).hexdigest()


df['md5'] = df.apply(lambda row: foo(row.col1, row.col2), axis=1)

df.head()
# Out[5]: 
#    col1    col2                               md5
# 0     0  100000  92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1     1  100001  01d14f5020a8ba2715cbad51fd4c503d
# 2     2  100002  c0e01b86d0a219cd71d43c3cc074e323
# 3     3  100003  d94e31d899d51bc00512938fc190d4f6
# 4     4  100004  7710d81dc7ded13326530df02f8f8300

But how would I apply function foo parallel, utilizing all available cores on my machine?


回答1:


The easiest way to do this is using concurrent.futures.

import concurrent.futures

with concurrent.futures.ProcessPoolExecutor(16) as pool:
    df['md5'] = list(pool.map(foo, df['col1'], df['col2'], chunksize=1_000))

df.head()
# Out[10]: 
#    col1    col2                               md5
# 0     0  100000  92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1     1  100001  01d14f5020a8ba2715cbad51fd4c503d
# 2     2  100002  c0e01b86d0a219cd71d43c3cc074e323
# 3     3  100003  d94e31d899d51bc00512938fc190d4f6
# 4     4  100004  7710d81dc7ded13326530df02f8f8300

Specifying chunksize=1_000 makes this run faster because each process will process 1000 rows at a time (i.e. you will pay the overhead of initializing a process only once per 1000 rows).

Note that this will only work in Python 3.2 or newer.



来源:https://stackoverflow.com/questions/47718865/how-to-apply-a-function-to-mulitple-columns-of-a-pandas-dataframe-in-parallel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!