using python nltk to find similarity between two web pages?

倖福魔咒の 提交于 2019-11-28 21:41:41

问题


I want to find whether two web pages are similar or not. Can someone suggest if python nltk with wordnet similarity functions helpful and how? What is the best similarity function to be used in this case?


回答1:


The spotsigs paper mentioned by joyceschan addresses content duplication detection and it contains plenty of food for thought.

If you are looking for a quick comparison of key terms, nltk standard functions might suffice.

With nltk you can pull synonyms of your terms by looking up the synsets contained by WordNet

>>> from nltk.corpus import wordnet

>>> wordnet.synsets('donation')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

>>> wordnet.synsets('donations')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

It understands plurals and it also tells you which part of speech the synonym corresponds to

Synsets are stored in a tree with more specific terms at the leaves and more general ones at the root. The root terms are called hypernyms

You can measure similarity by how close the terms are to the common hypernym

Watch out for different parts of speech, according to the NLTK cookbook they don't have overlapping paths, so you shouldn't try to measure similarity between them.

Say, you have two terms donation and gift, you can get them from synsets but in this example I initialized them directly:

>>> d = wordnet.synset('donation.n.01')
>>> g = wordnet.synset('gift.n.01')

The cookbook recommends Wu-Palmer Similarity method

>>> d.wup_similarity(g)
0.93333333333333335

This approach gives you a quick way to determine if the terms used correspond to related concepts. Take a look at Natural Language Processing with Python to see what else you can do to help your analysis of text.




回答2:


consider implementing Spotsigs



来源:https://stackoverflow.com/questions/6252236/using-python-nltk-to-find-similarity-between-two-web-pages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!