How to compare two pandas dataframes and remove duplicates on one file without appending data from other file [duplicate]

孤者浪人 提交于 2020-05-15 09:47:26

问题


I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.

I am creating two pandas dataframes from these files:

import pandas as pd

dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)

I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.

The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.

I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.

Here is an example of what the dataframes could look like.

test_master.csv

  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2

test_daily.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

Desired output is:

test_master.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

test_daily.csv

+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
|       10 |       11 |       12 |
|       13 |       14 |       15 |
+----------+----------+----------+

Any help would be greatly appreciated!

EDIT

I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.


回答1:


Try this:

I crate 2 indexes, and then set rows 2-4 to be duplicates:

import numpy as np

test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values

print(test_master)
print(test_daily)

output:

      A         B         C
0  0.009322  0.330057  0.082956
1  0.197500  0.010593  0.356774
2  0.147410  0.697779  0.421207
      A         B         C
0  0.643062  0.335643  0.215443
1  0.009322  0.330057  0.082956
2  0.197500  0.010593  0.356774
3  0.147410  0.697779  0.421207
4  0.973867  0.873358  0.502973

Then, add a multiindex level to identify which data is from which dataframe:

test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)

Now merge as you suggested and drop duplicates:

merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)

output:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
  master  0.009322  0.330057  0.082956
1 master  0.197500  0.010593  0.356774
2 master  0.147410  0.697779  0.421207
4 daily   0.973867  0.873358  0.502973

There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:

idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])

output:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
4 daily   0.973867  0.873358  0.502973


来源:https://stackoverflow.com/questions/50103918/how-to-compare-two-pandas-dataframes-and-remove-duplicates-on-one-file-without-a

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!