How to speed up loading data from oracle sql to pandas df

问题

My code looks like this, i use pd.DataFrame.from_records to fill data into the dataframe, but it takes Wall time: 1h 40min 30s to process the request and load data from the sql table with 22 mln rows into df.

# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()

def db_select(query): # takes the request text and sends it to the data_frame
    cur.execute(query)
    col = [column[0].lower() for column in cur.description] # parse headers
    df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
    return df

Then I pass the sql query to the function:

frame = db_select("select * from table")

How can i optimize code for speed up process?

回答1:

Setting proper value for cur.arraysize might help for tuning fetch performance . You need the determine the most suitable value for it. The default value is 100. A code with a different array sizes might be run in order to determine that value such as

arr=[100,1000,10000,100000,1000000]
for size in arr:
        try:
            cur.prefetchrows = 0
            cur.arraysize = size
            start = datetime.now()
            cur.execute("SELECT * FROM mytable").fetchall()
            elapsed = datetime.now() - start
            print("Process duration for arraysize ", size," is ", elapsed, " seconds")
        except Exception as err:
            print("Memory Error ", err," for arraysize ", size)

and then set such as cur.arraysize = 10000 before calling db_select from your original code

来源：https://stackoverflow.com/questions/65163602/how-to-speed-up-loading-data-from-oracle-sql-to-pandas-df

标签

python

sql

pandas

Oracle

parsing