I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patt
Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row)
means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1
.
If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.
Something like:
matrix = {}
for row in ws:
for cell in row:
matrix[cell.value] = (cell.row, cell.col_idx)
In a real-world example you'd probably want to use a defaultdict
to be able to handle multiple cells with the same text.
This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.