问题
I have two excel sheets. One contains summaries and the other contains categories with potential filter words. I need to assign categories to the first dataframe if any element matches in the second dataframe.
I have attempted to expand the list in the second dataframe and map by matching the terms to any words in the first dataframe.
Data for the test.
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
Bucket Summary
0 basket This is a basket of red apples. They are sour.
1 bushel We found a bushel of fruit. They are red and s...
2 peck There is a peck of pears that taste sweet. The...
3 box We have a box of plums. They are sour and have...
print(df2)
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
This line of script converts the Category column from the table to a list to use later.
category_list = df2['Category'].values
category_list = list(set(category_list))
Attempt to match the text.
for item in category_list:
item = df2.loc[df2['Category'] == item]
filter_list = item['Filters'].values
filter_list = list(set(filter_list))
df1 = df1 [df1 ['Summary'].isin(filter_list)]
I want the first dataframe to have categories assigned to it separated by a comma.
Result:
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit We have a box of plums. They are sour and have...
I hope this is clear. I have been banging my head against it for a week now.
Thank you in advance
回答1:
Use pandas.Series.str.contains to check Filters with a loop:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)
0 True
1 False
2 True
3 True
dtype: bool
0 True
1 True
2 True
3 False
dtype: bool
Then use np.where with Series.str.cat to get your dataframe output:
df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
To n Category filters:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
回答2:
This is my try using regex pattern and pandas string replaceall function. First filters are joined with "|" to get regex pattern which is matched using findall which puts match in tuple for corresponding group which is then used to find matched category
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
pat = df2.Filters.str.replace(", ", "|").str.replace("(.*)", "(\\1)").str.cat(sep="|")
found = df1.Summary.str.findall(pat) \
.apply(lambda x: [i for m in x for i, k in enumerate(m) if k!=""])
## for pandas 0.25 and above
# found= found.explode()
# for pandas below 0.25
found = found.apply(lambda x: pd.Series(x)).unstack().reset_index(level=0, drop=True).dropna()
found.name = "Cat_ID"
result = df1.merge(found, left_index=True, right_index=True) \
.merge(df2["Category"], left_on="Cat_ID", right_index=True).drop("Cat_ID", axis=1)
result = result.groupby(result.index).agg({"Bucket":"min", "Summary": "min", "Category": lambda x: ", ".join(x)})
result
来源:https://stackoverflow.com/questions/58205939/how-do-i-assign-categories-in-a-dataframe-if-they-contain-any-element-from-anoth