Efficient way to merge multiple large DataFrames

后端 未结 4 2019
臣服心动
臣服心动 2020-12-10 18:51

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import         


        
4条回答
  •  离开以前
    2020-12-10 19:26

    You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.

    I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.

    import pandas as pd
    
    dfs = {}
    dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
    dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
    dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
    dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   
    
    df = dfs[1].copy()
    
    for i in range(2, max(dfs)+1):
        df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                      left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
        df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
    
    print(df)
    
       0  1   2   3   4   5
    0  a  1  10  15  -1  -1
    1  a  2  20  20  -1  -1
    2  b  1   4  -1  -1  -1
    3  c  1   2   2  -1  -1
    4  e  2  10  -1  20  20
    5  d  1  -1  -1  10  10
    6  f  1  -1  -1   1  15
    

    You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.

提交回复
热议问题