Learnig NER using category list

狂风中的少年 提交于 2019-12-25 05:19:17

问题


In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.

Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.

This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.

PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.


回答1:


I think what you are talking about is Gazetteers list (dictionary.txt).

You would have to include corresponding feature for a word in training data and then specify it in template file.

For Example: Your list contains the entity: Hershey's and training data has a sentence: I like Hershey's chocolates.

So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's. You also have to include this column as feature in template file.

To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)

1) https://youtu.be/GJHeTvDkIaE

2) https://youtu.be/Ur5umC4BwN4

EDIT: (after viewing the OP's comment)

Sample Training Data with extra features: https://pastebin.com/fBgu8c67 I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.

Sample Template File for the above data : https://pastebin.com/LPvAGCVL

Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.



来源:https://stackoverflow.com/questions/43560764/learnig-ner-using-category-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!