Efficient way to merge multiple large DataFrames

后端 未结 4 2020
臣服心动
臣服心动 2020-12-10 18:51

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import         


        
4条回答
  •  [愿得一人]
    2020-12-10 19:30

    So, you have 48 dfs with 3 columns each - name, id, and different column for every df.

    You don`t must to use merge....

    Instead, if you concat all the dfs

    df = pd.concat([df1,df2,df3,df4])
    

    You will recieve:

    Out[3]: 
       id name  pricepart1  pricepart2  pricepart3  pricepart4
    0   1    a        10.0         NaN         NaN         NaN
    1   2    a        20.0         NaN         NaN         NaN
    2   1    b         4.0         NaN         NaN         NaN
    3   1    c         2.0         NaN         NaN         NaN
    4   2    e        10.0         NaN         NaN         NaN
    0   1    a         NaN        15.0         NaN         NaN
    1   2    a         NaN        20.0         NaN         NaN
    2   1    c         NaN         2.0         NaN         NaN
    0   1    d         NaN         NaN        10.0         NaN
    1   2    e         NaN         NaN        20.0         NaN
    2   1    f         NaN         NaN         1.0         NaN
    0   1    d         NaN         NaN         NaN        10.0
    1   2    e         NaN         NaN         NaN        20.0
    2   1    f         NaN         NaN         NaN        15.0
    

    Now you can group by name and id and take the sum:

    df.groupby(['name','id']).sum().fillna('missing').reset_index()
    

    If you will try it with the 48 dfs you will see it solves the MemoryError:

    dfList = []
    #To create the 48 DataFrames of size 62245 X 3
    for i in range(0, 49):
        dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))
    
    df = pd.concat(dfList)
    df.groupby(['name','id']).sum().fillna('missing').reset_index()
    

提交回复
热议问题