Efficient way to merge multiple large DataFrames

后端 未结 4 2018
臣服心动
臣服心动 2020-12-10 18:51

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import         


        
相关标签:
4条回答
  • 2020-12-10 19:21

    Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See Best way to join two large datasets in Pandas for example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.

    0 讨论(0)
  • 2020-12-10 19:26

    You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.

    I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.

    import pandas as pd
    
    dfs = {}
    dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
    dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
    dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
    dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   
    
    df = dfs[1].copy()
    
    for i in range(2, max(dfs)+1):
        df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                      left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
        df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
    
    print(df)
    
       0  1   2   3   4   5
    0  a  1  10  15  -1  -1
    1  a  2  20  20  -1  -1
    2  b  1   4  -1  -1  -1
    3  c  1   2   2  -1  -1
    4  e  2  10  -1  20  20
    5  d  1  -1  -1  10  10
    6  f  1  -1  -1   1  15
    

    You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.

    0 讨论(0)
  • 2020-12-10 19:30

    So, you have 48 dfs with 3 columns each - name, id, and different column for every df.

    You don`t must to use merge....

    Instead, if you concat all the dfs

    df = pd.concat([df1,df2,df3,df4])
    

    You will recieve:

    Out[3]: 
       id name  pricepart1  pricepart2  pricepart3  pricepart4
    0   1    a        10.0         NaN         NaN         NaN
    1   2    a        20.0         NaN         NaN         NaN
    2   1    b         4.0         NaN         NaN         NaN
    3   1    c         2.0         NaN         NaN         NaN
    4   2    e        10.0         NaN         NaN         NaN
    0   1    a         NaN        15.0         NaN         NaN
    1   2    a         NaN        20.0         NaN         NaN
    2   1    c         NaN         2.0         NaN         NaN
    0   1    d         NaN         NaN        10.0         NaN
    1   2    e         NaN         NaN        20.0         NaN
    2   1    f         NaN         NaN         1.0         NaN
    0   1    d         NaN         NaN         NaN        10.0
    1   2    e         NaN         NaN         NaN        20.0
    2   1    f         NaN         NaN         NaN        15.0
    

    Now you can group by name and id and take the sum:

    df.groupby(['name','id']).sum().fillna('missing').reset_index()
    

    If you will try it with the 48 dfs you will see it solves the MemoryError:

    dfList = []
    #To create the 48 DataFrames of size 62245 X 3
    for i in range(0, 49):
        dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))
    
    df = pd.concat(dfList)
    df.groupby(['name','id']).sum().fillna('missing').reset_index()
    
    0 讨论(0)
  • 2020-12-10 19:37

    You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.

    df_list = [df1, df2, ...]
    for df in df_list:
        df.set_index(['name', 'id'], inplace=True)
    
    df = pd.concat(df_list, axis=1) # join='inner'
    df.reset_index(inplace=True)
    

    Alternatively, you can replace the concat (second step) by an iterative join:

    from functools import reduce
    df = reduce(lambda x, y: x.join(y), df_list)
    

    This may or may not be better than the merge.

    0 讨论(0)
提交回复
热议问题