NER model to recognize Indian names

独自空忆成欢 提交于 2019-12-01 01:16:28

I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.

  1. Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
  2. Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
  3. Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
  4. Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
  5. Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names: http://nlp.stanford.edu/software/crf-faq.shtml#a

This website has done this for us!It provides with the solution for these problems: Challenges in Indian Language NER Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages. The challenges in NER arise due to several factors. Some of the main factors are listed below Morphologically rich - identification of root is difficult, require use of morphological analysers No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja". The whole corpus is provided.

Named Entity Recognition for Indian Languages and English

Best of luck for getting passwords for the zip files!

cheers!

A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.

I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).

You can have a look at these articles to find an interesting idea for you: https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!