Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?
E.g. turn My International Text: åäö
I'll throw my own (partial) solution here too:
import unicodedata
def deaccent(some_unicode_string):
return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
if unicodedata.category(c) != 'Mn')
This does not do all you want, but gives a few nice tricks wrapped up in a convenience method: unicode.normalise('NFD', some_unicode_string) does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepoints U+03B3 and U+0308.
The other method, unicodedata.category(char), returns the enicode character category for that particular char. Category Mn contains all combining accents, thus deaccent removes all accents from the words.
But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.