How to speed up import of large xlsx files?

问题

I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import file format is mandatory (I know that csv is faster...).

How can I speed up the process of importing a large Excel file into a pandas dataframe? Would be great to get the time down to around 1-2 minutes, if possible, which would be much more bearable.

What I have tried so far:

Option 1 - Pandas I/O read_excel

%%timeit -r 1
import pandas as pd
import datetime

xlsx_file = pd.ExcelFile("Data.xlsx")
list_sheets = []

for sheet in xlsx_file.sheet_names:
    list_sheets.append(xlsx_file.parse(sheet, header = 0, dtype={
        "Sales": float,
        "Client": str, 
        "Location": str, 
        "Country": str, 
        "Date": datetime.datetime
        }).fillna(0))

output_dataframe = pd.concat(list_sheets)

10min 44s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Option 2 - Dask

%%timeit -r 1
import pandas as pd
import dask
import dask.dataframe as dd
from dask.delayed import delayed

excel_file = "Data.xlsx"

parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0)
output_dataframe = dd.from_delayed(parts)

10min 12s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Option 3 - openpyxl and csv

Just creating the seperate csv files from the Excel workbook took around 10 minutes before even importing the csv files to a pandas dataframe via read_csv

%%timeit -r 1
import openpyxl
import csv

from openpyxl import load_workbook
wb = load_workbook(filename = "Data.xlsx", read_only=True)

list_ws = wb.sheetnames
nws = len(wb.sheetnames) #number of worksheets in workbook

# create seperate csv files from each worksheet (15 in total)
for i in range(0, nws):
    ws = wb[list_ws[i]]
    with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f:
        c = csv.writer(f)
        for r in ws.rows:
            c.writerow([cell.value for cell in r])

9min 31s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

I use Python 3.7.3 (64bit) on a single machine (Windows 10), 16GB RAM, 8 cores (i7-8650U CPU @ 1.90GHz). I run the code within my IDE (Visual Studio Code).

回答1:

The compression isn't the bottleneck, the problem is parsing the XML and creating new data structures in Python. Judging from the speeds you're quoting I'm assuming these are very large files: see the note on performance in the documentation for more details. Both xlrd and openpyxl are running close to the limits of the underyling Python and C libraries.

Starting with openpyxl 2.6 you do have the values_only option when reading cells which will speed things up a bit. You can also use multiple processes with read-only mode to read worksheets in parallel, which should speed things up if you have multiple processors.

来源：https://stackoverflow.com/questions/55778303/how-to-speed-up-import-of-large-xlsx-files

标签

python

pandas

openpyxl

dask

xlrd