Accelerate bulk insert using Django's ORM?

后端未结

关注

 7  737

I\'m planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django\'s ORM. Currently each file takes ~20min to process, and I was wondering i

相关标签:

7条回答

面向向阳花

2020-11-30 19:11

Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.

The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-11-30 19:13
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
- https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
- https://docs.djangoproject.com/en/dev/releases/1.4/
- https://code.djangoproject.com/ticket/7596
0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-11-30 19:15

Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-30 19:17
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
1. increase the cache size setting of your DB to use more RAM (default ones always very small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
2. do journalling in RAM instead of disk (this does cause slight problem if system fails, but something I consider to be negligible given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
3. last and perhaps most important one: do not build index while inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
```
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
```
The iterable_data could be a list or something alike, or even an open file reader.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-30 19:24

There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.

This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.

It also appears to be heavily documented, which is always a plus.

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-11-30 19:27

Drop to DB-API and use cursor.executemany(). See PEP 249 for details.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页