python - problems with regular expression and unicode

前端 未结 1 349
遇见更好的自我
遇见更好的自我 2020-12-18 09:06

Hi I have a problem in python. I try to explain my problem with an example.

I have this string:

>>> string = \'ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëì         


        
相关标签:
1条回答
  • 2020-12-18 09:26

    You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).

    Example:

    >>> string = 'ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
    >>> type(string)
    <type 'str'>
    
    # do this instead:
    # (note the u in front of the ', this marks the character sequence as a unicode literal)
    >>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
    # or:
    >>> string = 'ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
    # ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
    # ... it is a best practice to use the \xNN form in unicode literals, as in the first example
    
    >>> type(string)
    <type 'unicode'>
    >>> print string
    ÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
    
    >>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
    >>> print rePat.sub("", string)
    Ã
    

    When reading from a file, string = open('filename.txt').read() reads a byte sequence.

    To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').

    The codecs module can decode unicode streams (such as files) on-the-fly.

    Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.

    I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)

    I also recommend: http://www.joelonsoftware.com/articles/Unicode.html

    0 讨论(0)
提交回复
热议问题