Drop Dataframe Rows Based on a Similarity Measure Pandas

问题

I want to eliminate repeated rows in my dataframe.

I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:

       Title        |   Area   |    Price
Apartment at Boston    100         150000
Apt at Boston          105         149000

I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?

回答1:

Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
                  'Area':[100,105],
                  'Price':[150000,149000]})

def string_ratio(df,col,ratio):
    from difflib import SequenceMatcher
    import numpy as np
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    ratios = []
    for i, x in enumerate(df[col]):
        a = np.array([similar(x, row) for row in df[col]])
        a = np.where(a < ratio)[0]
        ratios.append(len(a[a != i])==0)
    return pd.Series(ratios)

def numeric_ratio(df,col,ratio):
    ratios = []
    for i, x in enumerate(df[col]):
        a = np.array([min(x,row)/max(x,row) for row in df[col]])
        a = np.where(a<ratio)[0]
        ratios.append(len(a[a != i])==0)
    return pd.Series(ratios)

mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))

df1[mask]

It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.

回答2:

See if this meets your needs

Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at   Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())

The df created is as below

Title 	Area 	Price
0 	Apartment at Boston 	100 	150000
1 	Apt at Boston 	105 	149000
2 	Apt at Chicago 	100 	150200
3 	Apt at Seattle 	102 	150300
4 	Apt at Seattle 	101 	150000
5 	Apt at Chicago 	101 	150000

Now, we run the code as below

from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
    val=fuzz.partial_ratio(a,b)
    return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
    itered=i+1    
    while itered < len(tl):
        val=fuzzy_compare(tl[i],tl[itered])
        if val > 80:
            if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
                if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
                    df.drop(itered,inplace=True)
                    df.reset_index()
                    pass
                else:
                    pass
            else:
                pass            
       else:
            pass
       itered=itered+1    
while i < len(tl)-1:
    try:
        do_the_thing(i)
        i=i+1
    except:
        i=i+1
        pass
else:
    pass

the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.

Title 	Area 	Price
0 	Apartment at Boston 	100 	150000
2 	Apt at Chicago 	100 	150200
3 	Apt at Seattle 	102 	150300

来源：https://stackoverflow.com/questions/57300449/drop-dataframe-rows-based-on-a-similarity-measure-pandas

标签

python

pandas

dataframe

rows