Openpyxl optimizing cells search speed

前端 未结 1 1402
小蘑菇
小蘑菇 2020-12-04 03:09

I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patt

相关标签:
1条回答
  • 2020-12-04 03:47

    Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row) means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1.

    If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.

    Something like:

    matrix = {}
    for row in ws:
        for cell in row:
             matrix[cell.value] = (cell.row, cell.col_idx)
    

    In a real-world example you'd probably want to use a defaultdict to be able to handle multiple cells with the same text.

    This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.

    0 讨论(0)
提交回复
热议问题