String similarity score/hash

前端 未结 12 1181
长发绾君心
长发绾君心 2020-12-07 09:52

Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number (h

12条回答
  •  失恋的感觉
    2020-12-07 10:47

    In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
    Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
    Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
    The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
    For example:

    {"Hello World", "Hello World!", "Hello Earth"}
    Choosing base-string="Hello World"  
    Med(base-string, "Hello World!") = 1  
    Med(base-string, "Hello Earth") = 8  
    1st closest string is "Hello World!"
    

    This have somewhat given a score to each string of your string-collection
    C# Implementation (Add-1, Deletion-1, Subsitution-2)

    public static int Distance(string s1, string s2)
    {
        int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
    
        for (int i = 0; i <= s1.Length; i++)
            matrix[i, 0] = i;
        for (int i = 0; i <= s2.Length; i++)
            matrix[0, i] = i;
    
        for (int i = 1; i <= s1.Length; i++)
        {
            for (int j = 1; j <= s2.Length; j++)
            {
                int value1 = matrix[i - 1, j] + 1;
                int value2 = matrix[i, j - 1] + 1;
                int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
    
                matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
            }
        }
    
        return matrix[s1.Length, s2.Length];
    }
    

    Complexity O(n x m) where n, m is length of each string
    More info on Minimum Edit Distance can be found here

提交回复
热议问题