String similarity -> Levenshtein distance

巧了我就是萌 提交于 2019-11-26 22:50:21

问题


I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).


回答1:


I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.




回答2:


I think this can be easily solved by employing the Longest Common Substring/Subsequence algorithm on one of the strings (e.g. "conair") and the other string appended to itself once (e.g. "aircon" -> "airconaircon").

Sample code in C:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns the length of the longest common substring (LCS)
// between two given strings.
//
// This recursive implementation can be replaced by a
// more performant dynamic programming implementation.
size_t llcs(const char* s1, const char* s2)
{
  size_t len[3];

  if (*s1 == '\0' || *s2 == '\0') return 0;

  len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1);
  len[1] = llcs(s1 + 1, s2);
  len[2] = llcs(s1, s2 + 1);

  if (len[0] < len[1]) len[0] = len[1];
  if (len[0] < len[2]) len[0] = len[2];

  return len[0];
}

// Returns similarity of two given strings in the range
// from 0.0 to 1.0 (1.0 for equal strings).
double similarity(const char* s1, const char* s2)
{
  size_t s1len = strlen(s1);
  size_t s2len = strlen(s2);
  double sim;

  if (s1len == 0 && s2len == 0)
  {
    // Two empty strings are equal
    sim = 1;
  }
  else
  {
    size_t len;
    // Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon")
    char* s1s1 = malloc(s1len * 2 + 1);
    strcpy(s1s1, s1);
    strcpy(s1s1 + s1len, s1);

    // Find the length of the LCS between s1s1 and s2
    // (e.g. between "airconaircon" and "conair")
    len = llcs(s1s1, s2);
    // We need it not longer than s1 (e.g. "aircon")
    // since we're actually comparing s1 and s2
    if (len > s1len) len = s1len;

    len *= 2;

    // Prevent 100% similarity between a string and its
    // cyclically shifted version (e.g. "aircon" and "conair")
    if (len == s1len + s2len && strcmp(s1, s2) != 0) len--;

    // Get the final measure of the similarity
    sim = (double)len / (s1len + s2len);

    free(s1s1);
  }

  return sim;
}

int main(int argc, char** argv)
{
  if (argc == 3)
    printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n",
           argv[1], argv[2], 100 * similarity(argv[1], argv[2]));
  else
    printf("Usage:\n  %s string1 string2\n",
           argv[0]);
  return 0;
}

Sample output:

Similarity of "123" and "123" is 100.00%
Similarity of "123" and "1234" is 85.71%
Similarity of "0123" and "123" is 85.71%
Similarity of "a" and "aa" is 66.67%
Similarity of "aa" and "a" is 66.67%
Similarity of "aaaaaaa" and "aaaaaa" is 92.31%
Similarity of "aaaaaa" and "aaaaaaa" is 92.31%
Similarity of "aircon" and "conair" is 91.67%
Similarity of "spit" and "pits" is 87.50%
Similarity of "pits" and "spit" is 87.50%
Similarity of "spits" and "pits" is 88.89%
Similarity of "pits" and "spits" is 88.89%



回答3:


It sounds like you may want to try doing Levenshtein distance using syllables or phonemes instead of letters.




回答4:


Theoretically, the approach you are using is correct for the problem that you are trying to solve. But Levenstein would consider only the individual characters the two sets.

The string similarity could also be found using Longest Common Subsequence method and then you can see Levenstein on rest of the unmatched.

In case you want to do a clustered approach, the following answer seems to have some details, but obviously it is more difficult to implement.




回答5:


Sorting the words and finding Levenshtein would give a 100% match for your example but it would also give a 100% match for, for e.g.

CONAIR
RCIAON

which might not be what you want.

The other way to define similarity would be to find out common substrings for 2 strings. You can create a Suffix Tree and find out all common substrings and try to determine how similar they are. So for your e.g. the suffix tree would give common substrings as CON & AIR which covers the whole word (for your 2 strings) and thus concluding them as similar.




回答6:


try using other similarity measures like sorenson,jaccard and jaro_winkler

personally i am huge fan of jaro winkler since it served my purpose many times.

from Levenshtein import jaro_winkler
In [2]: jaro_winkler("conair","aircon")
Out[2]: 0.8333333333333334



回答7:


Take a look to Needleman-Wunsch, or Smith-Waterman algorithms. They are used to handle string matching via adapted edit-distance for DNA sequences, where any kind of insertions, reversals, transposons may happen, to any length, at any place. Saying this, I need to add that for a sufficiently long string there is no optimal solution. And don't forget that the edit cost depend on the use-context of the algorithm (a semantics issue), while any algorithm is always a syntactical machine.



来源:https://stackoverflow.com/questions/11675034/string-similarity-levenshtein-distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!