Text similarity algorithm

后端 未结 5 1992
萌比男神i
萌比男神i 2020-12-24 15:08

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like \"The w

相关标签:
5条回答
  • 2020-12-24 15:50

    There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.

    The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...

    Here you find a helpful implementation of several algorithms within one library

    0 讨论(0)
  • 2020-12-24 15:59

    For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.

    You might want to look at several implementations that are described here: Cosine Similarity

    0 讨论(0)
  • 2020-12-24 16:02

    Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

    EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

    0 讨论(0)
  • 2020-12-24 16:05

    Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

    Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

    0 讨论(0)
  • 2020-12-24 16:08

    You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

    0 讨论(0)
提交回复
热议问题