问题
I want to eliminate repeated rows in my dataframe.
I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:
Title | Area | Price
Apartment at Boston 100 150000
Apt at Boston 105 149000
I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?
回答1:
Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
'Area':[100,105],
'Price':[150000,149000]})
def string_ratio(df,col,ratio):
from difflib import SequenceMatcher
import numpy as np
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
ratios = []
for i, x in enumerate(df[col]):
a = np.array([similar(x, row) for row in df[col]])
a = np.where(a < ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
def numeric_ratio(df,col,ratio):
ratios = []
for i, x in enumerate(df[col]):
a = np.array([min(x,row)/max(x,row) for row in df[col]])
a = np.where(a<ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))
df1[mask]
It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.
回答2:
See if this meets your needs
Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())
The df created is as below
Title Area Price
0 Apartment at Boston 100 150000
1 Apt at Boston 105 149000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
4 Apt at Seattle 101 150000
5 Apt at Chicago 101 150000
Now, we run the code as below
from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
val=fuzz.partial_ratio(a,b)
return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
itered=i+1
while itered < len(tl):
val=fuzzy_compare(tl[i],tl[itered])
if val > 80:
if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
df.drop(itered,inplace=True)
df.reset_index()
pass
else:
pass
else:
pass
else:
pass
itered=itered+1
while i < len(tl)-1:
try:
do_the_thing(i)
i=i+1
except:
i=i+1
pass
else:
pass
the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.
Title Area Price
0 Apartment at Boston 100 150000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
来源:https://stackoverflow.com/questions/57300449/drop-dataframe-rows-based-on-a-similarity-measure-pandas