Accelerate bulk insert using Django's ORM?

后端 未结 7 727
暖寄归人
暖寄归人 2020-11-30 18:57

I\'m planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django\'s ORM. Currently each file takes ~20min to process, and I was wondering i

相关标签:
7条回答
  • 2020-11-30 19:29

    I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:

    • Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
    • Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
    • Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
    • Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
    • Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
    def bulk_to_sql(df, columns, model_cls):
        """ Inserting 3000 takes 774ms avg """
        engine = ExcelImportProcessor._get_sqlalchemy_engine()
        df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
    
    
    def bulk_via_csv(df, columns, model_cls):
        """ Inserting 3000 takes 118ms avg """
        engine = ExcelImportProcessor._get_sqlalchemy_engine()
        connection = engine.raw_connection()
        cursor = connection.cursor()
        output = StringIO()
        df[columns].to_csv(output, sep='\t', header=False, index=False)
        output.seek(0)
        contents = output.getvalue()
        cur = connection.cursor()
        cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
        connection.commit()
        cur.close()
    

    The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.

    I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.

    Update 2020:

    I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).

    0 讨论(0)
提交回复
热议问题