If this was PHP, I would probably do something like this:
function no_more_half_widths($string){
$foo = array(\'1\',\'2\',\'3\',\'4\',\'5\',\'6\',\'7\',\'8
In Python 3, cleanest is to use str.translate and str.maketrans:
FULLWIDTH_TO_HALFWIDTH = str.maketrans('1234567890',
'1234567890')
def fullwidth_to_halfwidth(s):
return s.translate(FULLWIDTH_TO_HALFWIDTH)
In Python 2, str.maketrans is instead string.maketrans and doesn’t work with Unicode characters, so you need to make a dictionary, as Josh Lee notes above.
In Python3, you can use the following snippet. It made a map between all ascii characters and its corresponding fullwidth character. Best of all, this doesn't need you to hard type ascii sequence, which is quite error prone.
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
FULL2HALF[0x3000] = 0x20
def halfen(s):
'''
Convert full-width characters to ASCII counterpart
'''
return str(s).translate(FULL2HALF)
Also, with same logic, you can convert halfwidth characters to fullwidth character, the following code shows the trick:
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
HALF2FULL[0x20] = 0x3000
def fullen(s):
'''
Convert all ASCII characters to the full-width counterpart.
'''
return str(s).translate(HALF2FULL)
Note: this two snippets only consider ascii characters, and does not convert any japanese/korean fullwidth character.
For completeness, from wikepedia:
Range
U+FF01–FF5E
reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment.U+FF00
does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled byU+3000
"ideographic space."Range
U+FF65–FFDC
encodes halfwidth forms of Katakana and Hangul characters.Range
U+FFE0–FFEE
includes fullwidth and halfwidth symbols.
And more, python2 solution can refer to gist/jcayzac
Using the unicode.translate
method:
>>> table = dict(zip(map(ord,u'0123456789'),map(ord,u'0123456789')))
>>> print u'123'.translate(table)
123
It requires a mapping of code points as numbers, not characters. Also, using u'unicode literals'
leaves the values unencoded.
Regex approach
>>> import re
>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"456")
u'456'
I don't think there's a built-in function to do multiple replacements in one pass, so you'll have to do it yourself.
One way to do it:
>>> src = (u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u'10')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a123'
>>> for i, j in zip(src, dst):
... string = string.replace(i, j)
...
>>> string
u'a123'
Or using a dictionary:
>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> string = u'a123'
>>> for i, j in trans.iteritems():
... string = string.replace(i, j)
...
>>> string
u'a123'
Or finally, using regex (and this might actually be the fastest):
>>> import re
>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a123'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'
The built-in unicodedata
module can do it:
>>> import unicodedata
>>> foo = u'1234567890'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'
The “NFKC” stands for “Normalization Form KC [Compatibility Decomposition, followed by Canonical Composition]”, and replaces full-width characters by half-width ones, which are Unicode equivalent.
Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.