check if two words are related to each other

前端未结

关注

 3  1045

I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am usi

相关标签:

3条回答

生来不讨喜

2020-12-18 10:07

Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.

take a look NLTK pack (code example) as you are using python and maybe this paper

for people using c++ can check this open source project for reference

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-18 10:14
Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.

Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:
```
from nltk.corpus import wordnet as wn
from itertools import product

wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)

maxscore = 0
for i,j in list(product(*[sem1,sem2])):
  score = i.wup_similarity(j) # Wu-Palmer Similarity
  maxscore = score if maxscore < score else maxscore
```
There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-18 10:25
At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.

Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.

Here's a possibility
```
from difflib import SequenceMatcher

interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')

s = SequenceMatcher(None)

limit = 0.50

for interest in interests:
    s.set_seq2(interest)
    for keyword in keywords:
        s.set_seq1(keyword)
        b = s.ratio()>=limit and len(s.get_matching_blocks())==2
        print '%10s %-10s  %f  %s' % (interest, keyword,
                                      s.ratio(),
                                      '** MATCH **' if b else '')
    print
```
gives
```
  shooting loop        0.333333  
  shooting looping     0.666667  
  shooting game        0.166667  

    gaming loop        0.000000  
    gaming looping     0.461538  
    gaming game        0.600000  ** MATCH **

   looping loop        0.727273  ** MATCH **
   looping looping     1.000000  ** MATCH **
   looping game        0.181818  
```
Note this from the doc:

SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.
0 讨论(0)
发布评论:

提交评论
- 加载中...