Accelerate bulk insert using Django's ORM?

后端未结

关注

 7  738

I\'m planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django\'s ORM. Currently each file takes ~20min to process, and I was wondering i

相关标签:

7条回答

忘了有多久

2020-11-30 19:29
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
- Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
- Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
- Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
- Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
- Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
```
def bulk_to_sql(df, columns, model_cls):
    """ Inserting 3000 takes 774ms avg """
    engine = ExcelImportProcessor._get_sqlalchemy_engine()
    df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)


def bulk_via_csv(df, columns, model_cls):
    """ Inserting 3000 takes 118ms avg """
    engine = ExcelImportProcessor._get_sqlalchemy_engine()
    connection = engine.raw_connection()
    cursor = connection.cursor()
    output = StringIO()
    df[columns].to_csv(output, sep='\t', header=False, index=False)
    output.seek(0)
    contents = output.getvalue()
    cur = connection.cursor()
    cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
    connection.commit()
    cur.close()
```
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.

I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.

Update 2020:

I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Accelerate bulk insert using Django's ORM?

Update 2020: