Suppose I have 4 small DataFrames
df1
, df2
, df3
and df4
import pandas as pd
from functools import
Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See Best way to join two large datasets in Pandas for example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.
You can try a simple for
loop. The only memory optimization I have applied is downcasting to most optimal int
type via pd.to_numeric
.
I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object
type series. Here we use -1
, but you may wish to use NaN
with float
dtype instead.
So, you have 48 dfs with 3 columns each - name, id, and different column for every df.
You don`t must to use merge....
Instead, if you concat all the dfs
df = pd.concat([df1,df2,df3,df4])
You will recieve:
Out[3]:
id name pricepart1 pricepart2 pricepart3 pricepart4
0 1 a 10.0 NaN NaN NaN
1 2 a 20.0 NaN NaN NaN
2 1 b 4.0 NaN NaN NaN
3 1 c 2.0 NaN NaN NaN
4 2 e 10.0 NaN NaN NaN
0 1 a NaN 15.0 NaN NaN
1 2 a NaN 20.0 NaN NaN
2 1 c NaN 2.0 NaN NaN
0 1 d NaN NaN 10.0 NaN
1 2 e NaN NaN 20.0 NaN
2 1 f NaN NaN 1.0 NaN
0 1 d NaN NaN NaN 10.0
1 2 e NaN NaN NaN 20.0
2 1 f NaN NaN NaN 15.0
Now you can group by name and id and take the sum:
df.groupby(['name','id']).sum().fillna('missing').reset_index()
If you will try it with the 48 dfs you will see it solves the MemoryError:
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()
You may get some benefit from performing index-aligned concatenation using pd.concat
. This should hopefully be faster and more memory efficient than an outer merge as well.
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
Alternatively, you can replace the concat
(second step) by an iterative join
:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
This may or may not be better than the merge
.