Removing repetitive/duplicate occurance in excel using python

此生再无相见时 提交于 2021-02-05 06:35:08

问题


I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.

This is my input excel:

And need output like this:


回答1:


This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :

by creating a mask where you return a true/false boolean if the row is == the row above.

assuming your dataframe is called df

mask = df['NAME'].ne(df['NAME'].shift())

df.loc[~mask,'NAME'] = ''

explanation :

what we are doing above is the following,

first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=

lets see this in action.

import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']

now, lets create a dataframe similar to yours.

df = pd.DataFrame({'NAME' : names,
         'DEFAULT' : defaults,
         'CLASS' : classes,
         'AGE' : [np.random.randint(1,5) for len in names], 
         'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables. 

so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True

so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.

we do this by calling [.shift][2] hyperlinked for more info.

what this basically does is shift the rows by its index with a defined variable number, lets call this n.

if we called df['NAME'].shift(1)

0       NaN
1     Rekha
2     Rekha
3      Jaya
4      Jaya
5    Sushma
6      Nita
7      Nita

we can see here that that Rekha has moved down

so putting that all together,

df['NAME'].ne(df['NAME'].shift())
0     True
1    False
2     True
3    False
4     True
5     True
6    False
7    False

we assign this to a self defined variable called mask you could call this whatever you want.

we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.

however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.

    NAME    DEFAULT CLASS   AGE GROUP
1   Rekha       third   1   4
3   Jaya        fifth   1   1
6   Nita        fifth   1   2
7   Nita        fourth  1   4

all we need to do now is change these rows to blanks as your initial requirment, and we are left with.

    NAME    DEFAULT   CLASS AGE GROUP
0   Rekha             forth 2   2
1                     third 1   4
2   Jaya    c-default forth 3   3
3                     fifth 1   1
4   Sushma            fourth3   1
5   Nita    c-default third 4   2
6                     fifth 1   2
7                     fourth1   4

hope that helps!



来源:https://stackoverflow.com/questions/56915670/removing-repetitive-duplicate-occurance-in-excel-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!