Python make reading a Excel file faster

问题

I made a script that reads an Excel document en checks if the first row contains "UPDATED". If so it writes the whole row to another Excel document with the same Tab name.

My Excel document is 23 sheets with 1000 lines on each sheet, and now it takes more than 15 minutes to complete this. Is there a way to speed this up?

I was thinking about multithreading or multiprocessing but i don't know which one is better.

UPDATE: the fact that my program took 15 minutes to run was al caused by the READ-ONLY mode, when i removed it, it only took 2 seconds to run the program

import openpyxl
import os
from datetime import datetime

titles = ["Column1", "Column2", "Column3", "Column4", "Column5","Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16"]


def main():
    oldFilePath= os.getcwd() + "\oldFile.xlsx"
    newFilePath= os.getcwd() + "\newFile.xlsx"

    wb = openpyxl.load_workbook(filename=oldFilePath, read_only=True)
    wb2 = openpyxl.Workbook()

    sheets = wb.get_sheet_names()
    sheets2 = wb2.get_sheet_names()

    #removes all sheets in newFile.xlsx
    for sheet in sheets2:
        temp = wb2.get_sheet_by_name(sheet)
        wb2.remove_sheet(temp)

    for tab in sheets:
        print("Sheet: " + str(tab))
        rowCounter = 2

        sheet = wb[tab]
        for row in range(sheet.max_row):
            if sheet.cell(row=row + 1, column=1).value == "": #if cell is empty stop reading
                break
            elif sheet.cell(row=row + 1, column=1).value == "UPDATED":
                if tab not in sheets2:
                    sheet2 = wb2.create_sheet(title=tab)
                    sheet2.append(titles)

                for x in range(1, 17):
                    sheet2.cell(row=rowCounter, column=x).value = sheet.cell(row=row + 1, column=x).value

                rowCounter += 1

                sheets2 = wb2.get_sheet_names()

    wb2.save(filename=newFilePath)


if __name__ == "__main__":
    startTime = datetime.now()
    main()
    print("Script finished in: " + str(datetime.now() - startTime))

回答1:

For such small workbooks there is no need to use read-only mode and by using it injudiciously you are causing the problem yourself. Every call to ws.cell() will force openpyxl to parse the worksheet again.

So, either you stop using read-only mode, or use ws.iter_rows() as I advised on your previous question.

In general, if you think something is running slow you should always profile it rather than just trying somethng out and hoping for the best.

回答2:

You should have a look at some great multiprocessing tutorials, e.g.: https://www.blog.pythonlibrary.org/2016/08/02/python-201-a-multiprocessing-tutorial/

Also, the Python documentation will give you some great examples: https://docs.python.org/3.6/library/multiprocessing.html

You should give special attention to topics like using Pools and Queues.

Multiprocessing will help you get around the limitations of the Global Interpreter Lock, so that might be a good way to improve your performance.

Tuning the performance of I/O processes can be a tricky topic, so you'll need to find out more details about the bottleneck. If you can't improve its performance, you might try to find an alternative way to getting the same data.

来源：https://stackoverflow.com/questions/49400498/python-make-reading-a-excel-file-faster

标签

python

excel

python-3.x

openpyxl