Accelerate bulk insert using Django's ORM?

后端 未结 7 726
暖寄归人
暖寄归人 2020-11-30 18:57

I\'m planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django\'s ORM. Currently each file takes ~20min to process, and I was wondering i

相关标签:
7条回答
  • 2020-11-30 19:11

    Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.

    The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.

    0 讨论(0)
  • 2020-11-30 19:13

    Django 1.4 provides a bulk_create() method on the QuerySet object, see:

    • https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
    • https://docs.djangoproject.com/en/dev/releases/1.4/
    • https://code.djangoproject.com/ticket/7596
    0 讨论(0)
  • 2020-11-30 19:15

    Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

    0 讨论(0)
  • 2020-11-30 19:17

    This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:

    1. increase the cache size setting of your DB to use more RAM (default ones always very small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;

    2. do journalling in RAM instead of disk (this does cause slight problem if system fails, but something I consider to be negligible given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY

    3. last and perhaps most important one: do not build index while inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.

    As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:

    cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
    

    The iterable_data could be a list or something alike, or even an open file reader.

    0 讨论(0)
  • 2020-11-30 19:24

    There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.

    This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.

    It also appears to be heavily documented, which is always a plus.

    0 讨论(0)
  • 2020-11-30 19:27

    Drop to DB-API and use cursor.executemany(). See PEP 249 for details.

    0 讨论(0)
提交回复
热议问题