Text similarity algorithm

后端未结

关注

 5  1992

萌比男神i

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like \"The w

相关标签:

5条回答

野性不改

2020-12-24 15:50

There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.

The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...

Here you find a helpful implementation of several algorithms within one library

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-12-24 15:59

For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.

You might want to look at several implementations that are described here: Cosine Similarity

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-24 16:02

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-12-24 16:05

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-24 16:08

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

0 讨论(0)
发布评论:

提交评论
- 加载中...