问题
I have a dictionary:
'Consulting': {'Deloitte', 'EY', 'KPMG', 'PwC'},
'Education': {'.edu', 'College', 'University'},
'Government':{'state','.gov','city'},
'Corporate':{'corpor','consumer','care'},
...... etc.
I have a dataframe:
Sno Text column1 column2 ......
1 Deloitte.com
2 Texas.gov
3 smi@EY.com
4 UTD.edu
5 rapper@corporate.com
..... etc.
I want to use the dictionary to categorize the dataframe and build a column Category, like this:
Sno Text Category column1 column2 ......
1 Deloitte.com Consulting
2 Texas.gov Government
3 smi@EY.com Consulting
4 UTD.edu Education
5 rapper@corporate.com Corporate
..... etc.
How can I utilize the dictionary with multiple values in python to find full phrase or part of the phrase in the Text column and categorize it? Can we also use the same logic in case 2 matches exist? What will happen then?
Also, Might sound vague, but the reason I am using Dictionary is because we can map multiple values to one category, is there a better way to do it without the dictionary?
回答1:
IIUC after re-create your dict
do with findall
, then map it back
newdict = {i: k for k, v in d.items() for i in v}
df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)
Out[431]:
0 Consulting
1 Government
2 Consulting
3 Education
4 Corporate
Name: Text, dtype: object
df['cate']=df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)
回答2:
This can also be done using re
and np.vectorize
:
cat = re.compile('|'.join(f"(?P<{k}>{'|'.join(v)})" for k,v in categories.items()))
df['category'] = np.vectorize(lambda x: cat.search(x).lastgroup)(df.text)
This gave me:
text category
0 Deloitte.com Consulting
1 Texas.gov Government
2 smi@EY.com Consulting
3 UTD.edu Education
4 rapper@corporate.com Corporate
Basically I create a regex string consisting of the category dict keys as group names, and values as a pattern delimited by |
meaning or
. Then vectorize is used to map this regex search to each item getting the cooresponding group name found
来源:https://stackoverflow.com/questions/55190428/categorize-a-column-using-a-dictionary-key-multiple-values-pair