Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number (h
In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:
{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"
Med(base-string, "Hello World!") = 1
Med(base-string, "Hello Earth") = 8
1st closest string is "Hello World!"
This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)
public static int Distance(string s1, string s2)
{
int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
for (int i = 0; i <= s1.Length; i++)
matrix[i, 0] = i;
for (int i = 0; i <= s2.Length; i++)
matrix[0, i] = i;
for (int i = 1; i <= s1.Length; i++)
{
for (int j = 1; j <= s2.Length; j++)
{
int value1 = matrix[i - 1, j] + 1;
int value2 = matrix[i, j - 1] + 1;
int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
}
}
return matrix[s1.Length, s2.Length];
}
Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here