How to implement category based text tagging using WordNet or related to wordnet?

☆樱花仙子☆ 提交于 2019-12-03 14:49:54
|improve this question

You need to categorize a bunch of nouns (e.g. "car", "gear") into predefined categories (e.g. "automobile"). Although named-entity recognition is the proper way of getting this done, it has its issues, the main one being gathering enough annotated data for training the system properly.

WordNet can help by establishing semantic similarity between nouns, thereby helping you select categories based on similarity scores. There are several ways of establishing similarity scores. Some prominent ones are

The basic idea is that similar terms are grouped under similar categories by an ontology (such as WordNet). Therefore, the distance between their categories in the category tree of the ontology will be shorter if they are closely related, and longer otherwise. Perhaps the simplest such score is the path-score:

PathScore(s1, s2) = 1/pathLength(s1, s2)

where pathLength is the length of the path in the aforementioned category tree.

To illustrate:

PathScore(*car*, *automobile*) = 1.0;     // path score is always between 0 and 1
WuPalmerScore(*car*, *automobile*) = 1.0; // Wu & Palmer's score is always between 0 and 1

PathScore(*engine*, *automobile*) = 0.25;
WuPalmerScore(*engine*, *automobile*) = 0.88;

PathScore(*microprocessor*, *automobile*) = 0.09;
WuPalmerScore(*microprocessor*, *automobile*) = 0.58;

So, as you can see, terms that you want in the same category will usually have higher similarity scores. The best library for doing this is WordNet Similarity for Java, which offers several similarity metrics for you to experiment with. They also have an online demo here.

Caveat WordNet will not perform well if you are trying to label proper nouns. For example, if you want Hyundai to be in the automobile category and Samsung in the electronics category, this won't help at all ... simply because WordNet does not categorize these nouns. There are other ontologies built on top of WordNet that may help you in this scenario:

  • One such well-known ontology is Yago.
  • Using Wikipedia categories is another successful approach.
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!