Faster way to read Excel files to pandas dataframe

问题

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!

Does anyone have suggestions for speeding it up?

import pandas as pd

def OTT_read(xl,site_name):
    df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
                       usecols=[0,1,2],header=None,
                       names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
    return df

def make_OTT_df(FILEDIR,OTT_FILE):
    xl = pd.ExcelFile(FILEDIR + OTT_FILE)
    site_names = xl.sheet_names
    df_list = [OTT_read(xl,site_name) for site_name in site_names]
    return site_names,df_list

FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

回答1:

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.

# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))

# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
    csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv' 
    call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
    df[sheet]=pd.read_csv(csv)

Here's a little snippet of python to create the ExcelToCsv.vbs script:

#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
""";

f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()

This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue

回答2:

If you have less than 65536 rows (in each sheet) you can try xls (instead of xlsx. In my experience xls is faster than xlsx. It is difficult to compare to csv because it depends on the number of sheets.

Although this is not an ideal solution (xls is a binary old privative format), I have found it is useful if you have too many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality.

回答3:

I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv() is faster but you don't need a VB script to get a csv file.

Open your Excel file and save as *.csv (comma separated value) format.

Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv() to specify your encoding.

Encoding options are listed here

回答4:

There's no reason to open excel if you're willing to deal with slow conversion once.

Read the data into a dataframe with pd.read_excel()
Dump it into a csv right away with pd.to_csv()

Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

回答5:

In my experience, Pandas read_excel() works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_name to None it will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.

But the reason that it takes time is for where you parse texts in your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excel it took like below:

t0 = time.time()

def parse(datestr):
    y,m,d = datestr.split("/")
    return dt.date(int(y),int(m),int(d))

data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)

t1 = time.time()

print(t1 - t0)
## result: 37.54169297218323 seconds

In code above data is a dictionary of 46 Dataframes.

As others suggested, using read_csv() can help because reading .csv file is faster. But consider that for the fact that .xlsx files use compression, .csv files might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv

来源：https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe

标签

python

pandas

import-from-excel