问题
My code looks like this, i use pd.DataFrame.from_records to fill data into the dataframe, but it takes Wall time: 1h 40min 30s to process the request and load data from the sql table with 22 mln rows into df.
# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()
def db_select(query): # takes the request text and sends it to the data_frame
cur.execute(query)
col = [column[0].lower() for column in cur.description] # parse headers
df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
return df
Then I pass the sql query to the function:
frame = db_select("select * from table")
How can i optimize code for speed up process?
回答1:
Setting proper value for cur.arraysize might help for tuning fetch performance .
You need the determine the most suitable value for it. The default value is 100. A code with a different array sizes might be run in order to determine that value such as
arr=[100,1000,10000,100000,1000000]
for size in arr:
try:
cur.prefetchrows = 0
cur.arraysize = size
start = datetime.now()
cur.execute("SELECT * FROM mytable").fetchall()
elapsed = datetime.now() - start
print("Process duration for arraysize ", size," is ", elapsed, " seconds")
except Exception as err:
print("Memory Error ", err," for arraysize ", size)
and then set such as cur.arraysize = 10000 before calling db_select from your original code
来源:https://stackoverflow.com/questions/65163602/how-to-speed-up-loading-data-from-oracle-sql-to-pandas-df