How to use SQLAlchemy to dump an SQL file from query expressions to bulk-insert into a DBMS?

蓝咒 提交于 2019-12-04 00:54:50
Marius Gedminas

First, unless you actually have a machine with 50 CPU cores, using 50 threads/processes won't help performance -- it will actually make things slower.

Second, I've a feeling that if you used SQLAlchemy's way of inserting multiple values at once, it would be much faster than creating ORM objects and persisting them one-by-one.

Ordinarily, no, there's no way to get the query with the values included.

What database are you using though? Cause a lot of databases do have some bulk load feature for CSV available.

If you're willing to accept that certain values might not be escaped correctly than you can use this hack I wrote for debugging purposes:

'''Replace the parameter placeholders with values'''
params = compiler.params.items()
params.sort(key=lambda (k, v): len(str(k)), reverse=True)
for k, v in params:
    '''Some types don't need escaping'''
    if isinstance(v, (int, long, float, bool)):
        v = unicode(v)
    else:
        v = "'%s'" % v

    '''Replace the placeholders with values
    Works both with :1 and %(foo)s type placeholders'''
    query = query.replace(':%s' % k, v)
    query = query.replace('%%(%s)s' % k, v)

I would venture to say the time spent in the python script is in the per-record upload portion. To determine this you could write to CSV or discard the results instead of uploading new records. This will determine where the bottleneck is; at least from a lookup-vs-insert standpoint. If, as I suspect, that is indeed where it is you can take advantage of the bulk import feature most DBS have. There is no reason, and indeed some arguments against, inserting record-by-record in this kind of circumstance.

Bulk imports tend to do some interestng optimization such as doing it as one transaction w/o commits for each record (even just doing this could see an appreciable drop in run time); whenever feasible I recommend the bulk insert for large record counts. You could still use the producer/consumer approach, but have the consumer instead store the values in memory or in a file and then call the bulk import statement specific to the DB you are using. This might be the route to go if you need to do processing for each record in the CSV file. If so I would also consider how much of that can be cached and shared between records.

it is also possible that the bottleneck is using SQLAlchemy. Not that I know of any inherent issues, but given what you are doing it might be requiring a lot more processing than is necessary - as evidenced by the 8x difference in run times.

For fun, since you already know the SQL, try using a direct DBAPI module in Python to do it and compare run times.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!