Algorithm to find articles with similar text

后端未结

关注

 15  2435

梦谈多话 2020-11-28 18:10

I have many articles in a database (with title,text), I\'m looking for an algorithm to find the X most similar articles, something like Stack Overflow\'s \"Related Questions

15条回答

遥遥无期 (楼主)

2020-11-28 18:40
The link in @alex77's answer points to an the Sorensen-Dice Coefficient which was independently discovered by the author of that article - the article is very well written and well worth reading.

I have ended up using this coefficient for my own needs. However, the original coefficient can yield erroneous results when dealing with
- three letter word pairs which contain one misspelling, e.g. [and,amd] and
- three letter word pairs which are anagrams e.g. [and,dan]
In the first case Dice erroneously reports a coefficient of zero whilst in the second case the coefficient turns up as 0.5 which is misleadingly high.

An improvement has been suggested which in its essence consists of taking the first and the last character of the word and creating an additional bigram.

In my view the improvement is only really required for 3 letter words - in longer words the other bigrams have a buffering effect that covers up the problem. My code that implements this improvement is given below.
```
function wordPairCount(word)
{
 var i,rslt = [],len = word.length - 1;
 for(i=0;i < len;i++) rslt.push(word.substr(i,2));
 if (2 == len) rslt.push(word[0] + word[len]);
 return rslt;
}

function pairCount(arr)
{
 var i,rslt = [];
 arr = arr.toLowerCase().split(' ');
 for(i=0;i < arr.length;i++) rslt = rslt.concat(wordPairCount(arr[i]));
 return rslt;
}

function commonCount(a,b)
{
 var t;
 if (b.length > a.length) t = b, b = a, a = t; 
 t = a.filter(function (e){return b.indexOf(e) > -1;});
 return t.length;
}

function myDice(a,b)
{
 var bigrams = [],
 aPairs = pairCount(a),
 bPairs = pairCount(b);
 debugger;
 var isct = commonCount(aPairs,bPairs);
 return 2*commonCount(aPairs,bPairs)/(aPairs.length + bPairs.length); 
}

$('#rslt1').text(myDice('WEB Applications','PHP Web Application'));
$('#rslt2').text(myDice('And','Dan'));
$('#rslt3').text(myDice('and','aMd'));
$('#rslt4').text(myDice('abracadabra','abracabadra'));
```
```
*{font-family:arial;}
table
{
 width:80%;
 margin:auto;
 border:1px solid silver;
}

thead > tr > td
{
 font-weight:bold;
 text-align:center;
 background-color:aqua;
}
```
```
Phrase 1
Phrase 2
Dice




WEB Applications
PHP Web Application



And
Dan



and
aMd



abracadabra
abracabadra
```
Note the deliberate misspelling in the last example: abracadabra vs abracabadra. Even though no extra bigram correction is applied the coefficient reported is 0.9. With the correction it would have been 0.91.
0 讨论(0)

查看其它15个回答
发布评论:

提交评论
- 加载中...

Phrase 1	Phrase 2	Dice
WEB Applications	PHP Web Application
And	Dan
and	aMd
abracadabra	abracabadra