问题
I have 2 sheets with some data (18k rows each) and need to check if value from source.xlsx exists in a target.xlsx file. The rows in the source file should be unique. If the cell from source file exists in the target file (in specific column) then in next column in target file need to fill value from some column which is in the source file. It is quite tricky so example would look like:
target.xlsx
<table><tbody><tr><th>Data</th><th>price</th><th> </th></tr><tr><td>1234grt </td><td> </td><td> </td></tr><tr><td>7686tyug </td><td> </td><td> </td></tr><tr><td>9797tyu </td><td> </td><td> </td></tr><tr><td>9866yyy </td><td> </td><td> </td></tr><tr><td>98845r </td><td> </td><td> </td></tr><tr><td>4567yut </td><td> </td><td> </td></tr><tr><td>1234grt</td><td> </td><td> </td></tr><tr><td>98845r </td><td> </td><td> </td></tr></tbody></table>
source.xls
<table><tbody><tr><th>Data</th><th>price</th><th> </th></tr><tr><td>98845r </td><td>$50</td><td> </td></tr><tr><td>7686tyug </td><td>$67</td><td> </td></tr><tr><td>9797tyu </td><td>$56</td><td> </td></tr><tr><td>4567yut </td><td>$67</td><td> </td></tr><tr><td>9866yyy </td><td>$76</td><td> </td></tr><tr><td>98845r </td><td>$56</td><td> </td></tr><tr><td>1234grt</td><td>$34</td><td> </td></tr></tbody></table>
for i in range(1, source_sheet_max_rows, 1):
print(i)
if source_wb[temp_sheet_name].cell(row=i, column=1).value in target_values:
for j in range(1, target_sheet_max_rows, 1):
if target_wb[temp_sheet_name].cell(row=j, column=1).value == source_wb[temp_sheet_name].cell(row=i,
column=1).value:
target_wb[temp_sheet_name].cell(row=j, column=2).value = source_wb[temp_sheet_name].cell(row=i,
column=2).value
target_wb.save(str(temp_sheet_name))
target_values - contains the values from col 1 in target sheet
The above code works, but is very heavy and I think there is some better way do it. The files contain more than 18k rows so it would take ages to compare data. The tricky part is that I need to know in which row in the target file my cell from source file is to fill column with corresponding value. I am using openpyxl but if it is easier I could use pandas.
Thx
回答1:
Question: check if value from source.xlsx exists in a target.xlsx file.
Implement it like the following example:
Documentation: OpenPyXl - accessing-many-cells
Python - Mapping Types — dict, Python - object.__init__
class SourceSheet:
def __init__(self, ws):
self.ws = ws
def __iter__(self):
"""
Implement iterRows or iterRange
:return: yield a tuple (value_to_search, value_to_fill)
"""
# Example iterRange
for row in range(1, self.ws.max_rows + 1):
yield (self.ws.cell(row=row, column=1).value, self.ws.cell(row=row, column=2).value)
class TargetSheet:
def __init__(self, ws):
self.ws = ws
"""
Create a 'dict' from all Values in Column A
This allows Random Access the Cell Value to get the Cell Row Index
Dict.key == Cell Value
Dict.value = Cell Row Index
_columnA = {} # {cell.value:cell.row}
"""
self._columnA = dict(((c.value, c.row) for c in ws['A']))
def find(self, value):
"""
Implement either linear Search using iterRows over one Column or
search in dict to find 'value'
:param value: The value to find
:return: The Cell, to write the 'value_to_fill'
"""
# Example using dict
if value in self._columnA:
return self.ws.cell(row=self._columnA[value], column=2)
sourceSheet = SourceSheet(ws1)
targetSheet = TargetSheet(ws2)
for value_to_search, value_to_fill in sourceSheet:
print("SourceSheet:{}".format((value_to_search, value_to_fill)))
targetCell = targetSheet.find(value_to_search)
if targetCell:
print("Match: Write value '{}' to TargetSheet:'{}'".format(value_to_fill, targetCell))
targetCell.value = value_to_fill
else:
print("Value '{}' not fount in TargetSheet!".format(value_to_search))
Output:
SourceSheet:('cell.A1.value', 'cell.B1.value') Match: Write value 'cell.B1.value' to TargetSheet:'Cell.B1:' SourceSheet:('cell.A2.value', 'cell.B2.value') Match: Write value 'cell.B2.value' to TargetSheet:'Cell.B2:' SourceSheet:('cell.A3.value', 'cell.B3.value') Match: Write value 'cell.B3.value' to TargetSheet:'Cell.B3:'
Tested with Python: 3.5
回答2:
From my understanding of your question it seems like the rows in target file are not arranged in the same specific order as the source file.
for i in range(1, souce_sheet_max_rows):
for j in range(1, target_sheet_max_rows):
if target_wb[temp_sheet_name].cell(row=j, column=1).value == source_wb[temp_sheet_name].cell(row=i, column=1).value:
target_wb[temp_sheet_name].cell(row=j, column=2).value == source_wb[temp_sheet_name].cell(row=i, column=2).value
break
target_wb.save(temp_sheet_name)
来源:https://stackoverflow.com/questions/53725514/openpyxl-compare-cells