Fuzzy text (sentences/titles) matching in C#

后端 未结 5 569
春和景丽
春和景丽 2020-12-13 10:11

Hey, I\'m using Levenshteins algorithm to get distance between source and target string.

also I have method which returns value from 0 to 1:

/// <         


        
相关标签:
5条回答
  • 2020-12-13 10:29

    It sounds like what you want may be a longest substring match. That is, in your example, two files like

    trash..thash..song_name_mp3.mp3 and garbage..spotch..song_name_mp3.mp3

    would end up looking the same.

    You'd need some heuristics there, of course. One thing you might try is putting the string through a soundex converter. Soundex is the "codec" used to see if things "sound" the same (as you might tell a telephone operator). It's more or less a rough phonetic and mispronunciation semi-proof transliteration. It is definitely poorer than edit distance, but much, much cheaper. (The official use is for names, and only uses three characters. There's no reason to stop there, though, just use the mapping for every character in the string. See wikipedia for details)

    So my suggestion would be to soundex your strings, chop each one into a few length tranches (say 5, 10, 20) and then just look at clusters. Within clusters you can use something more expensive like edit distance or max substring.

    0 讨论(0)
  • 2020-12-13 10:34

    Your problem here may be distinguishing between noise words and useful data:

    • Rolling_Stones.Best_of_2003.Wild_Horses.mp3
    • Super.Quality.Wild_Horses.mp3
    • Tori_Amos.Wild_Horses.mp3

    You may need to produce a dictionary of noise words to ignore. That seems clunky, but I'm not sure there's an algorithm that can distinguish between band/album names and noise.

    0 讨论(0)
  • 2020-12-13 10:37

    Kind of old, but It might be useful to future visitors. If you're already using the Levenshtein algorithm and you need to go a little better, I describe some very effective heuristics in this solution:

    Getting the closest string match

    The key is that you come up with 3 or 4 (or more) methods of gauging the similarity between your phrases (Levenshtein distance is just one method) - and then using real examples of strings you want to match as similar, you adjust the weightings and combinations of those heuristics until you get something that maximizes the number of positive matches. Then you use that formula for all future matches and you should see great results.

    If a user is involved in the process, it's also best if you provide an interface which allows the user to see additional matches that rank highly in similarity in case they disagree with the first choice.

    Here's an excerpt from the linked answer. If you end up wanting to use any of this code as is, I apologize in advance for having to convert VBA into C#.


    Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.

    valueWords, valuePhrase, and Split function:

    Public Function valuePhrase#(ByRef S1$, ByRef S2$)
        valuePhrase = LevenshteinDistance(S1, S2)
    End Function
    
    Public Function valueWords#(ByRef S1$, ByRef S2$)
        Dim wordsS1$(), wordsS2$()
        wordsS1 = SplitMultiDelims(S1, " _-")
        wordsS2 = SplitMultiDelims(S2, " _-")
        Dim word1%, word2%, thisD#, wordbest#
        Dim wordsTotal#
        For word1 = LBound(wordsS1) To UBound(wordsS1)
            wordbest = Len(S2)
            For word2 = LBound(wordsS2) To UBound(wordsS2)
                thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
                If thisD < wordbest Then wordbest = thisD
                If thisD = 0 Then GoTo foundbest
            Next word2
    foundbest:
            wordsTotal = wordsTotal + wordbest
        Next word1
        valueWords = wordsTotal
    End Function
    
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ' SplitMultiDelims
    ' This function splits Text into an array of substrings, each substring
    ' delimited by any character in DelimChars. Only a single character
    ' may be a delimiter between two substrings, but DelimChars may
    ' contain any number of delimiter characters. It returns a single element
    ' array containing all of text if DelimChars is empty, or a 1 or greater
    ' element array if the Text is successfully split into substrings.
    ' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
    ' If Limit greater than 0, the function will only split Text into 'Limit'
    ' array elements or less. The last element will contain the rest of Text.
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
            Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
            Optional ByVal Limit As Long = -1) As String()
        Dim ElemStart As Long, N As Long, M As Long, Elements As Long
        Dim lDelims As Long, lText As Long
        Dim Arr() As String
    
        lText = Len(Text)
        lDelims = Len(DelimChars)
        If lDelims = 0 Or lText = 0 Or Limit = 1 Then
            ReDim Arr(0 To 0)
            Arr(0) = Text
            SplitMultiDelims = Arr
            Exit Function
        End If
        ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))
    
        Elements = 0: ElemStart = 1
        For N = 1 To lText
            If InStr(DelimChars, Mid(Text, N, 1)) Then
                Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
                If IgnoreConsecutiveDelimiters Then
                    If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
                Else
                    Elements = Elements + 1
                End If
                ElemStart = N + 1
                If Elements + 1 = Limit Then Exit For
            End If
        Next N
        'Get the last token terminated by the end of the string into the array
        If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
        'Since the end of string counts as the terminating delimiter, if the last character
        'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
        If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1
    
        ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
        SplitMultiDelims = Arr
    End Function
    

    Measures of Similarity

    Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.

    Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.

    enter image description here

    As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.

    Application To allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:

    value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight + 
            Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight + lengthWeight*lengthValue
    

    Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.

    enter image description here

    0 讨论(0)
  • 2020-12-13 10:40

    There's a lot of work done on somewhat related problem of DNA sequence alignment (search for "local sequence alignment") - classic algorithm being "Needleman-Wunsch" and more complex modern ones also easy to find. The idea is - similar to Greg's answer - instead of identifying and comparing keywords try to find longest loosely matching substrings within long strings.

    That being sad, if the only goal is sorting music, a number of regular expressions to cover possible naming schemes would probably work better than any generic algorithm.

    0 讨论(0)
  • 2020-12-13 10:47

    There is a GitHub repo implementing several methods.

    0 讨论(0)
提交回复
热议问题