python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?

问题

I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.

Thanks!

My code is:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)

回答1:

The DataFrame.to_sql method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.

When this is slow, it is not the fault of pandas.

Saving the output of the DataFrame.to_sql method to a file, then replaying that file over an ODBC connector will take the same amount of time.

The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT

For example:

BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';

The syntax reference is as follows:

BULK INSERT 
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ] 
      FROM 'data_file' 
     [ WITH 
    ( 
   [ [ , ] BATCHSIZE = batch_size ] 
   [ [ , ] CHECK_CONSTRAINTS ] 
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] 
   [ [ , ] DATAFILETYPE = 
      { 'char' | 'native'| 'widechar' | 'widenative' } ] 
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ] 
   [ [ , ] FIRSTROW = first_row ] 
   [ [ , ] FIRE_TRIGGERS ] 
   [ [ , ] FORMATFILE = 'format_file_path' ] 
   [ [ , ] KEEPIDENTITY ] 
   [ [ , ] KEEPNULLS ] 
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ] 
   [ [ , ] LASTROW = last_row ] 
   [ [ , ] MAXERRORS = max_errors ] 
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ] 
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ] 
   [ [ , ] ROWTERMINATOR = 'row_terminator' ] 
   [ [ , ] TABLOCK ] 
   [ [ , ] ERRORFILE = 'file_name' ] 
    )]

回答2:

I recently had the same problem and feel like to add an answer to this for others. to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24.0 there is a method parameter in pandas.to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

Note that your Database may has a parameter limit. In that case you also have to define a chunksize.

So the solution should simply look like to this:

my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')

If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.

回答3:

You can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports MS SQL, Postgres and MYSQL

uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental

Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook

d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), 
    apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')

回答4:

My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.

You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.

import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os

connect_string  = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()

def load_data(report_name):
    # my report_name variable is also my sql server table name so I use that variable to create table name string
    sql_table_name = 'AR_'+str(report_name)
    global chunk # to QC chunks that fail for some reason
    for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
        chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
        chunk.fillna('NULL',inplace=True)

        my_data = str(chunk.to_records(index=False).tolist()) # convert data to string 
        my_data = my_data[1:-1] # clean up the ends
        my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
        sql_table_name = [your sql server table name]

        sql = """
        INSERT INTO {0} 
        VALUES {1}

         """.format(sql_table_name,my_data)

        cursor.execute(sql)
        # you must call commit() to persist your data if you don't set autocommit to True
        connection.commit()

回答5:

Based on this answer - Aseem.

You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO

ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed

my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False)  # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="")  # null values become ''
conn.commit()
conn.close()
cur.close()

来源：https://stackoverflow.com/questions/29706278/python-pandas-to-sql-with-sqlalchemy-how-to-speed-up-exporting-to-ms-sql

标签

python

sql

pandas

sqlalchemy

pyodbc