Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

有些话、适合烂在心里 提交于 2019-12-03 06:57:20
aghast

I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:

from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data

for row in ws.iter_rows(rowRange):
    key = row[1].value

    Value_entries[key].extend([cell.value for cell in row])

# All done. Now write files:
for key in Value_entries.keys():
    with open(key + '.txt', 'w') as f:
        f.write(','.join(Value_entries[key]))

It looks like you only want cells from the B-column. In this case you can use ws.get_squared_range() to restrict the number of cells to look at.

for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
    for cell in row: # each row is always a sequence
         filename = cell.value
         if os.path.isfilename(filename):
              …

It's not clear what's happening with the else branch of your code but you should probably be closing any files you open as soon as you have finished with them.

Based on the other question you linked to, and the code above, it appears you have a spreadsheet of name - value pairs. The name in in column A and the value is in column B. A name can appear multiple times in column A, and there can be a different value in column B each time. The goal is to create a list of all the values that show up for each name.

First, a few observations on the code above:

  1. counter is never initialized. Presumably it is initialized to 1.

  2. open(textfilename,...) is called twice without closing the file in between. Calling open allocates some memory to hold data related to operating on the file. The memory allocated for the first open call may not get freed until much later, maybe not until the program ends. It is better practice to close files when you are done with them (see using open as a context manager).

  3. The looping logic isn't correct. Consider:

First iteration of inner loop:

for cell in row:                        # cell refers to A1
    valueLocation = "B" + str(counter)  # valueLocation is "B1"
    value = ws[valueLocation].value     # value gets contents of cell B1
    name = cell.value                   # name gets contents of cell A1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell A1, and
    writes value from cell B1 to the file
    ...
    counter = counter + 1                        # counter = 2

But each row has at least two cells, so on the second iteration of the inner loop:

for cell in row:                          # cell now refers to cell B1
    valueLocation = "B" + str(counter)    # valueLocation is "B2"
    value = ws[valueLocation].value       # value gets contents of cell B2
    name = cell.value                     # name gets contents of cell B1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell "B1"  <<<< wrong file
    writes the value of cell "B2" to the file            <<<< wrong value
    ...
    counter = counter + 1        # counter = 3 when cell B1 is processed

Repeat for each of 50K rows. Depending on how many unique values are in column B, the program could be trying to have hundreds or thousands of open files (based on contents of cells A1, B1, A2, B2, ...) ==>> very slow or program crashes.

  1. iter_rows() returns a tuple of the cells in the row.

  2. As people suggested in the other question, use a dictionary and lists to store the values and write them all out at the end. Like so (Im using python 3.5, so you may have to adjust this if you are using 2.7)

Here is a straight forward solution:

from collections import defaultdict

data = defaultdict(list)

# gather the values into lists associated with each name
# data will look like { 'name1':['value1', 'value42', ...],
#                       'name2':['value7', 'value23', ...],
#                       ...}
for row in ws.iter_rows():
    name = row[0].value
    value = row[1].value
    data[name].append(value)

for key,valuelist in data.items():
    # turn list of strings in to a long comma-separated string
    # e.g., ['value1', 'value42', ...] => 'value1,value42, ...'
    value = ",".join(valuelist)

    with open(key + ".txt", "w") as f:
        f.write(value)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!