Optimal chunksize parameter in pandas.DataFrame.to_sql

自作多情 提交于 2021-01-21 06:34:07

问题


Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksize parameter. The answers here are helpful for workflow, but I'm just asking about the value of chunksize affecting performance.

In [5]: df.shape
Out[5]: (24594591, 4)

In [6]: df.to_sql('existing_table',
                  con=engine, 
                  index=False, 
                  if_exists='append', 
                  chunksize=10000)

Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?


回答1:


I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.




回答2:


In my case, 3M rows having 5 columns were inserted in 8 mins when I used pandas to_sql function parameters as chunksize=5000 and method='multi'. This was a huge improvement as inserting 3M rows using python into the database was becoming very hard for me.



来源:https://stackoverflow.com/questions/35202981/optimal-chunksize-parameter-in-pandas-dataframe-to-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!