Javascript text similarity algorithm

一个人想着一个人 提交于 2019-12-17 18:38:43

问题


I'm building a website that should collect various news feeds and would like the texts to be compared for similarity. What i need is some sort of a news text similarity algorithm. I know that php has the similar_text function and am not sure how good it is + i need it for javascript. So if anyone could point me to an example or a plugin or any instruction on how this is possible or at least where to look and start investigating.


回答1:


There's a javascript implementation of the Levenshtein distance metric, which is often used for text comparisons. If you want to compare whole articles or headlines though you might be better off looking at intersections between the sets of words that make up the text (and frequencies of those words) rather than just string similarity measures.




回答2:


The question whether two texts are similar is a philosophical one as long as you don't specify exactly what it should mean. Consider the Strings "house" and "mouse". Seen from a semantic level they are not very similar, but they are very similar regarding their "physical appearance", because only one letter is different (and in this case you could go by Levenshtein distance).

To decide about similarity you need an appropriate text representation. You could – for instance – extract and count all n-grams and compare the two resulting frequency-vectors using a similarity measure as e.g. cosine similarity. Or you could stem the words to their root form after having removed all stopwords, sum up their occurrences and use this as input for a similarity measure.

There are plenty approaches and papers about that topic, e.g. this one about short texts. In any case: The higher the abstraction level where you want to decide if two texts are similar the more difficult it will get. I think your question is a non-trivial one (and hence my answer rather abstract) ... ;-)



来源:https://stackoverflow.com/questions/5042873/javascript-text-similarity-algorithm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!