问题
I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.
I am creating two pandas dataframes from these files:
import pandas as pd
dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)
I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.
The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.
I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.
Here is an example of what the dataframes could look like.
test_master.csv
column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2
test_daily.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
Desired output is:
test_master.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
test_daily.csv
+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+----------+----------+----------+
Any help would be greatly appreciated!
EDIT
I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.
回答1:
Try this:
I crate 2 indexes, and then set rows 2-4 to be duplicates:
import numpy as np
test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values
print(test_master)
print(test_daily)
output:
A B C
0 0.009322 0.330057 0.082956
1 0.197500 0.010593 0.356774
2 0.147410 0.697779 0.421207
A B C
0 0.643062 0.335643 0.215443
1 0.009322 0.330057 0.082956
2 0.197500 0.010593 0.356774
3 0.147410 0.697779 0.421207
4 0.973867 0.873358 0.502973
Then, add a multiindex level to identify which data is from which dataframe:
test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)
Now merge as you suggested and drop duplicates:
merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
master 0.009322 0.330057 0.082956
1 master 0.197500 0.010593 0.356774
2 master 0.147410 0.697779 0.421207
4 daily 0.973867 0.873358 0.502973
There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:
idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
4 daily 0.973867 0.873358 0.502973
来源:https://stackoverflow.com/questions/50103918/how-to-compare-two-pandas-dataframes-and-remove-duplicates-on-one-file-without-a