How to implement Unicode string matching by folding in python

后端 未结 5 956
灰色年华
灰色年华 2020-12-13 11:31

I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given \"key\" string; a catalog string is a \"hi

5条回答
  •  天命终不由人
    2020-12-13 12:04

    A general purpose solution (especially for search normalization and generating slugs) is the unidecode module:

    http://pypi.python.org/pypi/Unidecode

    It's a port of the Text::Unidecode module for Perl. It's not complete, but it translates all Latin-derived characters I could find, transliterates Cyrillic, Chinese, etc to Latin and even handles full-width characters correctly.

    It's probably a good idea to simply strip all characters you don't want to have in the final output or replace them with a filler (e.g. "äßœ$" will be unidecoded to "assoe$", so you might want to strip the non-alphanumerics). For characters it will transliterate but shouldn't (say, §=>SS and =>EU) you need to clean up the input:

    input_str = u'äßœ$'
    input_str = u''.join([ch if ch.isalnum() else u'-' for ch in input_str])
    input_str = str(unidecode(input_str)).lower()
    

    This would replace all non-alphanumeric characters with a dummy replacement and then transliterate the string and turn it into lowercase.

提交回复
热议问题