Fuzzy string matching in Python

前端 未结 3 1365
暗喜
暗喜 2020-12-23 23:32

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence

3条回答
  •  爱一瞬间的悲伤
    2020-12-23 23:43

    There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.

    • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:

      • lowercase conversion,
      • no whitespaces, special characters removal,
      • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )

      This would result in "Andrew H Smith", "andrew h. smith", "ándréw h. smith" generating same key "andrewhsmith", and would reduce your set of million names to a smaller set of unique/similar grouped names.

    You can use this utlity method to normalize your string (does not include the unicode part though) :

    def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
        """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
            if normalized is True and removes the substrings which are in ignore_list)
        Args:
            input_str (str) : input string to be processed
            normalized (bool) : if True , method removes special characters and extra whitespace from string,
                                and converts to lowercase
            ignore_list (list) : the substrings which need to be removed from the input string
        Returns:
           str : returns processed string
        """
        for ignore_str in ignore_list:
            input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)
    
        if normalized is True:
            input_str = input_str.strip().lower()
            #clean special chars and extra whitespace
            input_str = re.sub("\W", "", input_str).strip()
    
        return input_str
    
    • Now similar strings will already lie in the same bucket if their normalized key is same.

    • For further comparison, you will need to compare the keys only, not the names. e.g andrewhsmith and andrewhsmeeth, since this similarity of names will need fuzzy string matching apart from the normalized comparison done above.

    • Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.

    • Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith, ándréw h. smith, and Andrew H. Smeeth generate keys andrewhsmith,andrewhsmith, and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist.

    Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:

    def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
        """ Calculates matching ratio between two strings
        Args:
            first_str (str) : First String
            second_str (str) : Second String
            normalized (bool) : if True ,method removes special characters and extra whitespace
                                from strings then calculates matching ratio
            ignore_list (list) : list has some characters which has to be substituted with "" in string
        Returns:
           Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                        using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                        equal weightage to each
        Examples:
            >>> find_string_similarity("hello world","Hello,World!",normalized=True)
            1.0
            >>> find_string_similarity("entrepreneurship","entreprenaurship")
            0.95625
            >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
            1.0
        """
        first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
        second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
        match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
        return match_ratio
    

提交回复
热议问题