Removing right-to-left mark and other unicode characters from input in Python

问题

I am writing a forum in Python. I want to strip input containing the right-to-left mark and things like that. Suggestions? Possibly a regular expression?

回答1:

If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:

>>> uc = u'aäöüb'
>>> uc.encode('ascii', 'ignore')
'ab'

回答2:

The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:

comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'

That of course, with the two statements in this order, would be a different error (the first one tries to access comment but only the second one binds that name), but let's assume the two lines are interchanged, as follows:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment is a byte string (no leading u before the opening quote), but .encode applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii, and that of course fails because the string is full of non-ascii characters.

Inserting the leading u in that literal would work:

comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

(this of course leaves comment empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')

here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment empty).

回答3:

It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:

import unicodedata

your_unicode_string= filter(
    lambda c: unicodedata.category(c) != 'Cf',
    your_unicode_string)

回答4:

"example".replace(u'\u200e', '')

You can remove the characters by the hex values with .replace() method.

来源：https://stackoverflow.com/questions/2946674/removing-right-to-left-mark-and-other-unicode-characters-from-input-in-python

标签

python

unicode

right-to-left