categorise text in column using keywords

牧云@^-^@ 提交于 2021-01-29 04:09:47

问题


I have a table column, that contain description of the treatment done to resolve an issue, this text contian keywords.

In other list, I have the list of categories, with the different keywords that helps to identify it.

For example:


Category | keywords

AAAA | keyword1

AAAA | keyword2 and keyword3

AAAA | keyword3 and not keyword4

BBBB | keyword4

BBBB | keyword5 and keyword6

BBBB | keyword7

how can fill the category column in my previous table (that contain the description), using the keywords in it.

For example:


     Description                  |  category

this free text keyword1 is done | AAAA


free sample2 keyword4 keyword3 | BBBB


the language I'm using is python,

I found a similar case, but using Excel: https://exceljet.net/formula/categorize-text-with-keywords

kIND REGARDS


回答1:


I would start by creating a list of tuples where the first element is the category and the second is a dictionary with list of keywords that should be included/excluded from the description. For example

keyword_tuple = [('AAAA', {'in': ['kwrd1'], 'out':[]}), 
                 ('AAAA', {'in': ['kwrd2', 'kwrd3'], 'out': []),
                 ('AAAA', {'in': ['kwrd3'], 'out': ['kwrd4']}), 
                 ('BBBB', {'in': ['kwrd4'], 'out': [])]

After you have initialized correctly your keyword_tuple you can loop through your descriptions list to determine to which category they belong. Let's store the results in a list of tuples called result_tuple where the first element is the description and the second the corresponding category.

result_tuple = []

for description in description_list:
    # Find categories that satisfy the include condition
    categories_in = [cat[0] for cat in keyword_tuple if all([kw in description for kw in cat[1]['in']])]
    # Find categories that satisfy the exclude condition
    categories_out = [cat[0] for cat in keyword_tuple if all([kw not in description for kw in cat[1]['out']])]

    # Find the categories that satisfy both 
    # If there are multiple categories satisfying the condition, you need to come with a decision rule
    categories = list(set(categories_in).intersection(categories_out))

    # Append to the result list (Takes the first that is satisfied)
    if len(categories) > 0:
        category = categories[0]
    else:
        category = 'NO CATEGORY'

    result_tuple.append(description, category)


来源:https://stackoverflow.com/questions/50388822/categorise-text-in-column-using-keywords

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!