问题

I have two excel sheets. One contains summaries and the other contains categories with potential filter words. I need to assign categories to the first dataframe if any element matches in the second dataframe.

I have attempted to expand the list in the second dataframe and map by matching the terms to any words in the first dataframe.

Data for the test.

import pandas as pd

data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}

data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

print(df1)

   Bucket                                            Summary
0  basket     This is a basket of red apples. They are sour.
1  bushel  We found a bushel of fruit. They are red and s...
2    peck  There is a peck of pears that taste sweet. The...
3     box  We have a box of plums. They are sour and have...

print(df2)

  Category                   Filters
0    Fruit  apple, pear, plum, grape
1    Color        red, purple, green

This line of script converts the Category column from the table to a list to use later.

category_list =  df2['Category'].values

category_list = list(set(category_list))

Attempt to match the text.

for item in category_list:

    item = df2.loc[df2['Category'] == item]

    filter_list =  item['Filters'].values

    filter_list = list(set(filter_list))

    df1 = df1 [df1 ['Summary'].isin(filter_list)]

I want the first dataframe to have categories assigned to it separated by a comma.

Result:

Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel         Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box         Fruit  We have a box of plums. They are sour and have...

I hope this is clear. I have been banging my head against it for a week now.

Thank you in advance

回答1:

Use pandas.Series.str.contains to check Filters with a loop:

df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)

0     True
1    False
2     True
3     True
dtype: bool 

0     True
1     True
2     True
3    False
dtype: bool

Then use np.where with Series.str.cat to get your dataframe output:

df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)

   Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel       , Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box       Fruit,   We have a box of plums. They are sour and have...

To n Category filters:

df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)

   Bucket      Category                                            Summary
0  basket  Fruit, Color     This is a basket of red apples. They are sour.
1  bushel       , Color  We found a bushel of fruit. They are red and s...
2    peck  Fruit, Color  There is a peck of pears that taste sweet. The...
3     box       Fruit,   We have a box of plums. They are sour and have...

回答2:

This is my try using regex pattern and pandas string replaceall function. First filters are joined with "|" to get regex pattern which is matched using findall which puts match in tuple for corresponding group which is then used to find matched category

import pandas as pd

data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}

data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

pat = df2.Filters.str.replace(", ", "|").str.replace("(.*)", "(\\1)").str.cat(sep="|")

found = df1.Summary.str.findall(pat) \
 .apply(lambda x: [i for m in x for i, k in enumerate(m) if k!=""])

## for pandas 0.25 and above
# found= found.explode()

# for pandas below 0.25
found = found.apply(lambda x: pd.Series(x)).unstack().reset_index(level=0, drop=True).dropna()

found.name = "Cat_ID"

result = df1.merge(found, left_index=True, right_index=True) \
    .merge(df2["Category"], left_on="Cat_ID", right_index=True).drop("Cat_ID", axis=1)


result = result.groupby(result.index).agg({"Bucket":"min", "Summary": "min", "Category": lambda x: ", ".join(x)})

result

来源：https://stackoverflow.com/questions/58205939/how-do-i-assign-categories-in-a-dataframe-if-they-contain-any-element-from-anoth

标签

python