问题
I use google-diff-match-patch C# library. I want to measure the similarity between two texts. To do this I make this C# code :
List<DiffMatchPatch.Diff> lDiffs = dmpDiff.diff_main(sTexte1, sTexte2);
int iIndex = dmpDiff.diff_levenshtein(lDiffs);
double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);
With similarity values between 0 - 100 (0 == perfect match - 100 == totaly different).
Do you think this is a good approach, that this calculation is correct?
回答1:
I've had a look at diff_levenshtein
on the API home page and it gives this description
Given a diff, measure its Levenshtein distance in terms of the number of inserted, deleted or substituted characters. The minimum distance is 0 which means equality, the maximum distance is the length of the longer string.
In the following line, all you are turning the distance (the change measurement) into a percentage of the original string length, and then substracting it from one hundred.
double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100);
So, yes, this looks fine to me.
My only comment would be that the original algorithm uses 0 to represent a perfect match and you are using 100, which might be confusing. If you are ok with this, make your you comment it appropriately for future maintainers.
来源:https://stackoverflow.com/questions/19089666/how-to-calculate-similarity-with-google-diff-match-patch-c-sharp-library