How to concatenate multiple pandas.DataFrames without running into MemoryError

后端 未结 10 1348
盖世英雄少女心
盖世英雄少女心 2020-12-24 12:19

I have three DataFrames that I\'m trying to concatenate.

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I re

10条回答
  •  伪装坚强ぢ
    2020-12-24 13:07

    The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

    With such huge data, performance is an issue.

    csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

    Here the code:

    import numpy as np
    import pandas as pd
    
    # a DataFrame factory:
    dfs=[]
    for i in range(10):
        dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))
    
    # a csv solution
    def bycsv(dfs):
        md,hd='w',True
        for df in dfs:
            df.to_csv('df_all.csv',mode=md,header=hd,index=None)
            md,hd='a',False
        #del dfs
        df_all=pd.read_csv('df_all.csv',index_col=None)
        os.remove('df_all.csv') 
        return df_all    
    

    Better solutions :

    def byHDF(dfs):
        store=pd.HDFStore('df_all.h5')
        for df in dfs:
            store.append('df',df,data_columns=list('0123'))
        #del dfs
        df=store.select('df')
        store.close()
        os.remove('df_all.h5')
        return df
    
    def bypickle(dfs):
        c=[]
        with open('df_all.pkl','ab') as f:
            for df in dfs:
                pickle.dump(df,f)
                c.append(len(df))    
        #del dfs
        with open('df_all.pkl','rb') as f:
            df_all=pickle.load(f)
            offset=len(df_all)
            df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))
    
            for size in c[1:]:
                df=pickle.load(f)
                df_all.iloc[offset:offset+size]=df.values 
                offset+=size
        os.remove('df_all.pkl')
        return df_all
    

    For homogeneous dataframes, we can do even better :

    def byhand(dfs):
        mtot=0
        with open('df_all.bin','wb') as f:
            for df in dfs:
                m,n =df.shape
                mtot += m
                f.write(df.values.tobytes())
                typ=df.values.dtype                
        #del dfs
        with open('df_all.bin','rb') as f:
            buffer=f.read()
            data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
            df_all=pd.DataFrame(data=data,columns=list(range(n))) 
        os.remove('df_all.bin')
        return df_all
    

    And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

    In [92]: %time w=bycsv(dfs)
    Wall time: 8.06 s
    
    In [93]: %time x=byHDF(dfs)
    Wall time: 547 ms
    
    In [94]: %time v=bypickle(dfs)
    Wall time: 219 ms
    
    In [95]: %time y=byhand(dfs)
    Wall time: 109 ms
    

    A check :

    In [195]: (x.values==w.values).all()
    Out[195]: True
    
    In [196]: (x.values==v.values).all()
    Out[196]: True
    
    In [197]: (x.values==y.values).all()
    Out[196]: True
    

    Of course all of that must be improved and tuned to fit your problem.

    For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

    I can edit it if you give more information on your data structure and size if you want. Beautiful question !

提交回复
热议问题