I have the following code:
import string
def translate_non_alphanumerics(to_translate, translate_to=\'_\'):
not_letters_or_digits = u\'!\"#%\\\'()*+,-./:
I had a unique problem compared to the others here. First I knew that my string possibly had unicode chars in it. (Thanks to Email on Mac...) But one of the common chars was the emdash AKA u"\u2014" character which needed to be converted (back) to two dashes AKA "--". The other chars that might be found are single char translations so they are similar to the other solutions.
First I created a dict for the emdash. For these I use a simple string.replace() to convert them. Other similar chars could be handled here too.
uTranslateDict = {
u"\u2014": "--", # Emdash
}
Then I created a tuple for the 1:1 translations. These go through the string.translate() builtin.
uTranslateTuple = [(u"\u2010", "-"), # Hyphen
(u"\u2013", "-"), # Endash
(u"\u2018", "'"), # Left single quote => single quote
(u"\u2019", "'"), # Right single quote => single quote
(u"\u201a", "'"), # Single Low-9 quote => single quote
(u"\u201b", "'"), # Single High-Reversed-9 quote => single quote
(u"\u201c", '"'), # Left double quote => double quote
(u"\u201d", '"'), # Right double quote => double quote
(u"\u201e", '"'), # Double Low-9 quote => double quote
(u"\u201f", '"'), # Double High-Reversed-9 quote => double quote
(u"\u2022", "*"), # Bullet
]
Then the function.
def uTranslate(uToTranslate):
uTranslateTable = dict((ord(From), unicode(To)) for From, To in uTranslateTuple)
for c in uTranslateDict.keys():
uIntermediateStr = uToTranslate.decode("utf-8").replace(c, uTranslateDict[c])
return uIntermediateStr.translate(uTranslateTable)
Since I know the format of the input string I didn't have to worry about two types of input strings.