MemoryError on large merges with pandas in Python

后端未结

关注

 3  741

日久生厌 2021-01-17 16:01

I\'m using pandas to do an outer merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id which is shared between al

3条回答

难免孤独 (楼主)

2021-01-17 16:15

pd.concat seems to run out of memory for large dataframes as well, one option is to convert the dfs to matrixes and concat these.

def concat_df_by_np(df1,df2):
    """
    accepts two dataframes, converts each to a matrix, concats them horizontally and
    uses the index of the first dataframe. This is not a concat by index but simply by
    position, therefore the index of both dataframes should be the same
    """
    dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
                                  index   = df1.index, 
                                  columns = np.concatenate([df1.columns,df2.columns])))
    if (df1.index!=df2.index).any():
       #logging.warning('Indices in concat_df_by_np are not the same')                     
       print ('Indices in concat_df_by_np are not the same')                     


    return dfout

However, one needs to be careful as this function is not a join but rather a horizontal append while where the indices are ignored

0 讨论(0)

查看其它3个回答