How to perform a merge of (too) large dataframes?

问题

I'm trying to merge couple of dataframes from HomeCredit Kaggle competion according to the data schema. I did following:

train = pd.read_csv('~/Documents/HomeCredit/application_train.csv')
bureau = pd.read_csv('~/Documents/HomeCredit/bureau.csv')
bureau_balance = pd.read_csv('~/Documents/HomeCredit/bureau_balance.csv')

train = train.merge(bureau,how='outer',left_on=['SK_ID_CURR'],right_on=['SK_ID_CURR'])
train = train.merge(bureau_balance,how='inner',left_on=['SK_ID_BUREAU'],right_on=['SK_ID_BUREAU'])

which fails at

MemoryError

for the second merge. The train data frame is of shape (308k,122), bureau (1.72M,12) and bureau_balance (27.3M,3). It is my understanding that an application from the train df does not have to have a record in burea table but all rows from that table should have a record in bureau_balance.

I'm running the code at my local instance with 16GB RAM.

Is there a way how to cope around the memory issue with such a large dataset?

Thanks in advance.

回答1:

After a certain problem size pandas is not the appropriate tool. I would import the data in a relational database and issue SQL queries. Sqlalchemy is a nice python tool for working with databases.

来源：https://stackoverflow.com/questions/55121153/how-to-perform-a-merge-of-too-large-dataframes

标签

python

pandas

numpy

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!