Categorize a column using a Dictionary key - multiple values pair

牧云@^-^@ 提交于 2020-01-03 02:53:07

问题


I have a dictionary:

'Consulting': {'Deloitte', 'EY', 'KPMG', 'PwC'},
'Education': {'.edu', 'College', 'University'},
'Government':{'state','.gov','city'},
'Corporate':{'corpor','consumer','care'},
 ...... etc.

I have a dataframe:

 Sno  Text            column1    column2 ......
  1   Deloitte.com
  2   Texas.gov
  3   smi@EY.com
  4   UTD.edu
  5   rapper@corporate.com

 ..... etc.

I want to use the dictionary to categorize the dataframe and build a column Category, like this:

 Sno  Text                   Category       column1    column2 ......
  1   Deloitte.com           Consulting
  2   Texas.gov              Government
  3   smi@EY.com             Consulting
  4   UTD.edu                Education
  5   rapper@corporate.com   Corporate
 ..... etc.

How can I utilize the dictionary with multiple values in python to find full phrase or part of the phrase in the Text column and categorize it? Can we also use the same logic in case 2 matches exist? What will happen then?

Also, Might sound vague, but the reason I am using Dictionary is because we can map multiple values to one category, is there a better way to do it without the dictionary?


回答1:


IIUC after re-create your dict do with findall, then map it back

newdict = {i: k for k, v in d.items() for i in v}
df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)
Out[431]: 
0    Consulting
1    Government
2    Consulting
3     Education
4     Corporate
Name: Text, dtype: object

df['cate']=df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)



回答2:


This can also be done using re and np.vectorize:

cat = re.compile('|'.join(f"(?P<{k}>{'|'.join(v)})" for k,v in categories.items()))
df['category'] = np.vectorize(lambda x: cat.search(x).lastgroup)(df.text)

This gave me:

                   text    category
0          Deloitte.com  Consulting
1             Texas.gov  Government
2            smi@EY.com  Consulting
3               UTD.edu   Education
4  rapper@corporate.com   Corporate

Basically I create a regex string consisting of the category dict keys as group names, and values as a pattern delimited by | meaning or. Then vectorize is used to map this regex search to each item getting the cooresponding group name found



来源:https://stackoverflow.com/questions/55190428/categorize-a-column-using-a-dictionary-key-multiple-values-pair

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!