Find common values within a column containing list of items

问题

I have a dataset that contains a few columns that are a list of items. I have given an example below. I am trying to find the entries that have items within the list with a 100% match. I would like to find the ones that have 90% or lower.

>>> df2 = pd.DataFrame({ 'ID':['1', '2', '3', '4', '5', '6', '7', '8'], 'Productdetailed': [['Phone', 'Watch', 'Pen'], ['Pencil', 'fork', 'Eraser'], ['Apple', 'Mango', 'Orange'], ['Something', 'Nothing', 'Everything'], ['Eraser', 'fork', 'Pencil'], ['Phone', 'Watch', 'Pen'],['Apple', 'Mango'], ['Pen', 'Phone', 'Watch']]})

>>> df2
ID                   Productdetailed
0  1               [Phone, Watch, Pen]
1  2            [Pencil, fork, Eraser]
2  3            [Apple, Mango, Orange]
3  4  [Something, Nothing, Everything]
4  5            [Eraser, fork, Pencil]
5  6               [Phone, Watch, Pen]
6  7                    [Apple, Mango]
7  8               [Pen, Phone, Watch]

If you notice the index 0 and index 7 in df2, have the same set of items but in different order. Where as index 0 and index 5 have same items in same order. I would like to consider both of them as a match. I tried groupby and series.isin(). I also tried intersection by splitting the dataset into two but it fails with type error.

First I would like to count the number of exact matched items(number of matched row count will do) along with the row index numbers it matched to. But when there are items that have only partially matched like index 2 and index 6 in df2. I would like to say the percent of items that have matched and against which column numbers.

I mentioned. I tried to split the data on specific column value into two parts. Then

applied df2['Intersection'] = 
     [list(set(a).intersection(set(b))) 
         for a, b in zip(df2_part1.Productdetailed, df2_part2.Productdetailed)
     ]

, where a and b are the Productdetailed column from the broken pieces of df2_part1 and df2_part2.

Is there a way to do this? Please help

回答1:

This solution solves the exact match task (Code complexity is very high and not recommended):

#First create a dummy column of Productdetailed which is sorted
df2['dummy'] = df2['Productdetailed'].apply(sorted)
#Create Matching column which stores index of first matched list
df2['Matching'] = np.nan

#Code for finding the exact matches and assigning indices in Matching column
for index1,lst1 in enumerate(df2['dummy']):
    for index2,lst2 in enumerate(df2['dummy']):
        if index1<index2:
            if (lst1 == lst2):
                if np.isnan(df2.loc[index2,'Matching']):
                    df2.loc[index1,'Matching'] = index1
                    df2.loc[index2,'Matching'] = index1

#Finding the sum of total exact matches
print(df2['Matching'].notnull().sum())
5

#Deleting the dummy column
del df2['dummy']

#Final Dataframe
print(df2)

  ID                   Productdetailed  Matching
0  1               [Phone, Watch, Pen]       0.0
1  2            [Pencil, fork, Eraser]       1.0
2  3            [Apple, Mango, Orange]       NaN
3  4  [Something, Nothing, Everything]       NaN
4  5            [Eraser, fork, Pencil]       1.0
5  6               [Phone, Watch, Pen]       0.0
6  7                    [Apple, Mango]       NaN
7  8               [Pen, Phone, Watch]       0.0

For both Fully and Partially Match use (If atleast 2 values matches it is partially matched, can also be changed):

#First create a dummy column of Productdetailed which is sorted
df2['dummy'] = df2['Productdetailed'].apply(sorted)
#Create Matching column which stores index of first matched list
df2['Matching'] = np.nan
#Create Column Stating Status of Matching
df2['Status'] = 'No Match'

#Code for finding the exact matches and assigning indices in Matching column
for index1,lst1 in enumerate(df2['dummy']):
    for index2,lst2 in enumerate(df2['dummy']):
        if index1<index2:
            if (lst1 == lst2):
                if np.isnan(df2.loc[index2,'Matching']):
                    df2.loc[index1,'Matching'] = index1
                    df2.loc[index2,'Matching'] = index1
                    df2.loc[[index1,index2],'Status'] = 'Fully Matched'
            else:
                count = sum([1 for v1 in lst1 for v2 in lst2 if v1==v2])
                if count>=2:
                    if np.isnan(df2.loc[index2,'Matching']):
                        df2.loc[index1,'Matching'] = index1
                        df2.loc[index2,'Matching'] = index1
                        df2.loc[[index1,index2],'Status'] = 'Partially Matched'

#Finding the sum of total exact matches
print(df2['Matching'].notnull().sum())

7

#Deleting the dummy column
del df2['dummy']

#Final Dataframe
print(df2)

  ID                   Productdetailed  Matching             Status
0  1               [Phone, Watch, Pen]       0.0      Fully Matched
1  2            [Pencil, fork, Eraser]       1.0      Fully Matched
2  3            [Apple, Mango, Orange]       2.0  Partially Matched
3  4  [Something, Nothing, Everything]       NaN           No Match
4  5            [Eraser, fork, Pencil]       1.0      Fully Matched
5  6               [Phone, Watch, Pen]       0.0      Fully Matched
6  7                    [Apple, Mango]       2.0  Partially Matched
7  8               [Pen, Phone, Watch]       0.0      Fully Matched

回答2:

To know the exact match:

df2["Productdetailed"]=df2["Productdetailed"].sort_values()
# create new colum from the sorted list. More easy to work with pivot table
df2['Productdetailed_str'] = df2['Productdetailed'].apply(lambda x: ', '.join(x))
df2["hit"] = 1
df3 = (df2.pivot_table(index=["Productdetailed_str"],
                 values=["ID", "hit"],
                aggfunc={'ID': lambda x: ', '.join(x), 'hit': 'sum'}
               ))

Hit is number of occurrences. result df3:

                                  ID  hit
Productdetailed_str                      
Apple, Mango                       7    1
Apple, Mango, Orange               3    1
Eraser, fork, Pencil               5    1
Pen, Phone, Watch                  8    1
Pencil, fork, Eraser               2    1
Phone, Watch, Pen               1, 6    2
Something, Nothing, Everything     4    1

The partial match is more difficult but you can start splitting the list and play with the pivot table:

test = df2.apply(lambda x: pd.Series(x['Productdetailed']),axis=1).stack().reset_index(level=1, drop=True).to_frame(name='list').join(df2)

If you run test. You have in "list column" the word that is in "Productdetailed column" list. Also, you have the ID... so I think that with pivot table you can extract the info..

来源：https://stackoverflow.com/questions/52590439/find-common-values-within-a-column-containing-list-of-items

标签

python

pandas

count

string-matching