What is the use of Brown Corpus in measuring Semantic Similarity based on WordNet

耗尽温柔 提交于 2019-12-05 17:37:45
arturomp

Take a look at the explanation at the NLTK howto for wordnet.

Specifically, the *_ic notation is information content.

synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.

A bit more info on information content from here:

The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with statistics on their actual usage in text as derived from a large corpus

The brown_ic in your code refers to the information content file ~/nltk_data/corpora/wordnet_ic/ic-brown.dat. For more detail on the format of the ic-brown.dat, check out this thread from the NLTK-user group.

Overall, the ic-brown.dat file lists every word existing in the Brown corpus and their information content values (which are associated with word frequencies).

The semantic measures by JC, Resnik, and Lin all require the use of a corpus in addition to the WordNet. These measures combine WordNet with corpus statistics and they are shown to achieve better correlations to human judgment than using WordNet alone (Li 2006; Pedersen 2010).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!