Finding if two strings are almost similar

后端 未结 6 1381
小鲜肉
小鲜肉 2020-12-29 10:00

I want to find out if you strings are almost similar. For example, string like \'Mohan Mehta\' should match \'Mohan Mehte\' and vice versa. Another example, string like \'Um

相关标签:
6条回答
  • 2020-12-29 10:12

    What you want is a string distance. There many flavors, but I would recommend starting with the Levenshtein distance.

    0 讨论(0)
  • 2020-12-29 10:13

    You can use difflib.sequencematcher if you want something from the stdlib:

    from difflib import SequenceMatcher
    s_1 = 'Mohan Mehta'
    s_2 = 'Mohan Mehte'
    print(SequenceMatcher(a=s_1,b=s_2).ratio())
    0.909090909091
    

    fuzzywuzzy is one of numerous libs that you can install, it uses the difflib module with python-Levenshtein. You should also check out the wikipage on Approximate_string_matching

    0 讨论(0)
  • 2020-12-29 10:15

    you might want to look at NLTK (The Natural Language Toolkit), specifically the nltk.metrics package, which implements various string distance algorithms, including the Levenshtein distance mentioned already.

    0 讨论(0)
  • 2020-12-29 10:19

    You could split the string and check to see if it contains at least one first/last name that is correct.

    0 讨论(0)
  • 2020-12-29 10:34

    Another approach is to use a "phonetic algorithm":

    A phonetic algorithm is an algorithm for indexing of words by their pronunciation.

    For example using the soundex algorithm:

    >>> import soundex
    >>> s = soundex.getInstance()
    >>> s.soundex("Umesh Gupta")
    'U5213'
    >>> s.soundex("Umash Gupte")
    'U5213'
    >>> s.soundex("Umesh Gupta") == s.soundex("Umash Gupte")
    True
    
    0 讨论(0)
  • 2020-12-29 10:37
    // calculate the similarity between 2 strings
    
      public static double similarity(String s1, String s2) {
        String longer = s1, shorter = s2;
        if (s1.length() < s2.length()) { // longer should always have greater length
          longer = s2; shorter = s1;
        }
        int longerLength = longer.length();
        if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
        /* // If you have StringUtils, you can use it to calculate the edit distance:
        return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) /
                                   (double) longerLength; */
        return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
    
      }
    
      // Example implementation of the Levenshtein Edit Distance
      // See http://rosettacode.org/wiki/Levenshtein_distance#Java
      public static int editDistance(String s1, String s2) {
        s1 = s1.toLowerCase();
        s2 = s2.toLowerCase();
    
        int[] costs = new int[s2.length() + 1];
        for (int i = 0; i <= s1.length(); i++) {
          int lastValue = i;
          for (int j = 0; j <= s2.length(); j++) {
            if (i == 0)
              costs[j] = j;
            else {
              if (j > 0) {
                int newValue = costs[j - 1];
                if (s1.charAt(i - 1) != s2.charAt(j - 1))
                  newValue = Math.min(Math.min(newValue, lastValue),
                      costs[j]) + 1;
                costs[j - 1] = lastValue;
                lastValue = newValue;
              }
            }
          }
          if (i > 0)
            costs[s2.length()] = lastValue;
        }
        return costs[s2.length()];
      }
    
    0 讨论(0)
提交回复
热议问题