Suppose I have 4 small DataFrames
df1
, df2
, df3
and df4
import pandas as pd
from functools import
You can try a simple for
loop. The only memory optimization I have applied is downcasting to most optimal int
type via pd.to_numeric
.
I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object
type series. Here we use -1
, but you may wish to use NaN
with float
dtype instead.