Python: How can I replace full-width characters with half-width characters?

后端 未结 6 1526
清酒与你
清酒与你 2020-12-16 00:51

If this was PHP, I would probably do something like this:

function no_more_half_widths($string){
  $foo = array(\'1\',\'2\',\'3\',\'4\',\'5\',\'6\',\'7\',\'8         


        
相关标签:
6条回答
  • 2020-12-16 01:04

    In Python 3, cleanest is to use str.translate and str.maketrans:

    FULLWIDTH_TO_HALFWIDTH = str.maketrans('1234567890',
                                           '1234567890')
    def fullwidth_to_halfwidth(s):
        return s.translate(FULLWIDTH_TO_HALFWIDTH)
    

    In Python 2, str.maketrans is instead string.maketrans and doesn’t work with Unicode characters, so you need to make a dictionary, as Josh Lee notes above.

    0 讨论(0)
  • 2020-12-16 01:07

    In Python3, you can use the following snippet. It made a map between all ascii characters and its corresponding fullwidth character. Best of all, this doesn't need you to hard type ascii sequence, which is quite error prone.

     #! /usr/bin/env python3
     # -*- coding: utf-8 -*-     
    
     FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
     FULL2HALF[0x3000] = 0x20
    
     def halfen(s):
         '''
         Convert full-width characters to ASCII counterpart
         '''
         return str(s).translate(FULL2HALF)
    

    Also, with same logic, you can convert halfwidth characters to fullwidth character, the following code shows the trick:

     #! /usr/bin/env python3
     # -*- coding: utf-8 -*-
    
     HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
     HALF2FULL[0x20] = 0x3000
    
     def fullen(s):
         '''
         Convert all ASCII characters to the full-width counterpart.
         '''
         return str(s).translate(HALF2FULL)
    

    Note: this two snippets only consider ascii characters, and does not convert any japanese/korean fullwidth character.

    For completeness, from wikepedia:

    Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space."

    Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters.

    Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.

    And more, python2 solution can refer to gist/jcayzac

    0 讨论(0)
  • 2020-12-16 01:19

    Using the unicode.translate method:

    >>> table = dict(zip(map(ord,u'0123456789'),map(ord,u'0123456789')))
    >>> print u'123'.translate(table)
    123
    

    It requires a mapping of code points as numbers, not characters. Also, using u'unicode literals' leaves the values unencoded.

    0 讨论(0)
  • 2020-12-16 01:27

    Regex approach

    >>> import re
    >>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"456")
    u'456'
    
    0 讨论(0)
  • 2020-12-16 01:28

    I don't think there's a built-in function to do multiple replacements in one pass, so you'll have to do it yourself.

    One way to do it:

    >>> src = (u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u'10')
    >>> dst = ('1','2','3','4','5','6','7','8','9','0')
    >>> string = u'a123'
    >>> for i, j in zip(src, dst):
    ...     string = string.replace(i, j)
    ... 
    >>> string
    u'a123'
    

    Or using a dictionary:

    >>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
    >>> string = u'a123'
    >>> for i, j in trans.iteritems():
    ...     string = string.replace(i, j)
    ...     
    >>> string
    u'a123'
    

    Or finally, using regex (and this might actually be the fastest):

    >>> import re
    >>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
    >>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
    >>> string = u'a123'
    >>> lookup.sub(lambda x: trans[x.group()], string)
    u'a123'
    
    0 讨论(0)
  • 2020-12-16 01:29

    The built-in unicodedata module can do it:

    >>> import unicodedata
    >>> foo = u'1234567890'
    >>> unicodedata.normalize('NFKC', foo)
    u'1234567890'
    

    The “NFKC” stands for “Normalization Form KC [Compatibility Decomposition, followed by Canonical Composition]”, and replaces full-width characters by half-width ones, which are Unicode equivalent.

    Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.

    0 讨论(0)
提交回复
热议问题