Parallel programming approach to solve pandas problems

匿名 (未验证) 提交于 2019-12-03 01:36:02

问题:

I have a dataframe of the following format.
df

A   B  Target 5   4   3 1   3   4

I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.

回答1:

Here, I have tried to implement your operation using numba

import numpy as np import pandas as pd from numba import jit, int64, float64  #  #------------You can ignore the code starting from here--------- # # Create a random DF with cols_size = 72391 and row_size =300 df_dict = {} for i in range(0, 72391):   df_dict[i] = np.random.randint(100, size=300) target_array = np.random.randint(100, size=300)  df = pd.DataFrame(df_dict) # ----------Ignore code till here. This is just to generate dummy data-------  # Assume df is your original DataFrame target_array = df['target'].values  # You can choose to restore this column later # But for now we will remove it, since we will  # call the df.values and find correlation of each  # column with target df.drop(['target'], inplace=True, axis=1)  # This function takes in a numpy 2D array and a target array as input # The numpy 2D array has the data of all the columns # We find correlation of each column with target array # numba's Jit required that both should have same columns # Hence the first 2d array is transposed, i.e. it's shape is (72391,300) # while target array's shape is (300,)  def do_stuff(df_values, target_arr):   # Just create a random array to store result   # df_values.shape[0] = 72391, equal to no. of columns in df   result = np.random.random(df_values.shape[0])    # Iterator over each column   for i in range(0, df_values.shape[0]):      # Find correlation of a column with target column     # In order to find correlation we must transpose array to make them compatible     result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]   return result  # Decorate the function do_stuff do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)  # This contains all the correlation result_array = do_stuff_numba(np.transpose(df.T.values), target_array)

Link to colab notebook.



回答2:

You should take a look at dask. It should be able to do what you want and a lot more. It parallelizes most of the DataFrame functions.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!