Compare 2 Excel files and output an Excel file with differences

萝らか妹 提交于 2019-12-12 06:43:13

问题


Assume for simplicity that the data files look like this, sorted on ID:

ID  | Data1  | Data2  | Data3   | Data4
199 |  Tim   |   55   |  work  |  $55
345 |  Joe   |   45   |  work  |  $34
356 |  Sam   |   23   |  uni   |  $12

Each file has more than 100,000 rows and about 50 columns. I want to compare a 2nd file with the first for new records (new ID ), edits (IDs match but columns 2 or 4 have changed (Data1 and Data3), and Deletes (ID in first file does not exist in the 2nd file).

Output is to appear in an Excel file with the first column containing D, E or N (for Delete, Edit and New), and the rest of the columns being the same as the columns in the files being compared.

For new records the full new record is to appear in the output file. For Edits both the records are to appear in the output file, but only those fields that have changed are to appear. For deleted records the full old record is to appear in the output file.

I would also like the following output to the screen as the files are being processed:

Deletes: D: 199, Tim
Edits:   E: 345, Joe -> John
         E: 345, work -> xxx
New:     N: 999, Ami

Thanks.


回答1:


I suggest you read some of the excellent introductions to pandas to understand how and why this works and to adapt it to your specific needs

Reading the excel-files

pandas.read_excel

import pandas as pd

filename1 = 'filename1.xlsx'
filename2 = 'filename2.xlsx'

df1 = pd.read_excel(filename1, index_col=0)
df2 = pd.read_excel(filename2, index_col=0)

df1 and df2 should be pandas.DataFrames with ID as index and the first row as columns or headers

Merging the files

pandas.merge

df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator=True)

Selecting the changes

id_new = df_merged.index[df_merged['_merge'] == 'right_only'] 
id_deleted = df_merged.index[df_merged['_merge'] == 'left_only'] 
id_changed_data1 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data1_x'] != df_merged['Data1_y'])]
id_changed_data3 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data3_x'] != df_merged['Data3_y'])]

This gives you lists (or an Index rather) of the changes, which you can format as you want



来源:https://stackoverflow.com/questions/45452686/compare-2-excel-files-and-output-an-excel-file-with-differences

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!