I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions.
I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble.
tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "")
for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed.
Things like this u\2013 EN DASH and this HORIZONTAL ELLIPSIS are definitely not needed.
Is there a way to remove those quotation marks and use simple standard quotes that won't break text processing 'with nltk' and remove things like those EN DASH, HORIZONTAL ELLIPSIS without making such a monster call like I see starting to rear it's head in the sample code above?
If your text is in English and you want to clean it up in a human-readable way, use the third-party module unidecode
. It replaces a wide range of characters with their nearest ascii look-alike. Just apply unidecode.unidecode()
to any string to make the substitutions:
from unidecode import unidecode
clean = unidecode(u'Some text: \u2018\u2019\u2013\u03a9')
If you need to do this kind of characters "normalisation" you may consider implementing a codec for the Codec registry.
The implementation is similar as the one proposed by @RomanPerekhrest with a table of substitution characters.
Implementing a codec
Import the codecs
module, give a name to your codec (avoid existing names).
Create the encoding table (the one you'll use when you do u"something".encode(...)
:
import codecs
NAME = "normalize"
_ENCODING_TABLE = {
u'\u2002': u' ',
u'\u2003': u' ',
u'\u2004': u' ',
u'\u2005': u' ',
u'\u2006': u' ',
u'\u2010': u'-',
u'\u2011': u'-',
u'\u2012': u'-',
u'\u2013': u'-',
u'\u2014': u'-',
u'\u2015': u'-',
u'\u2018': u"'",
u'\u2019': u"'",
u'\u201a': u"'",
u'\u201b': u"'",
u'\u201c': u'"',
u'\u201d': u'"',
u'\u201e': u'"',
u'\u201f': u'"',
}
The table above can "normalize" spaces, hyphens, quotation marks. This is where normalisation rules go…
Then, implement the function used to normalize your string:
def normalize_encode(input, errors='strict'):
output = u''
for char in input:
output += _ENCODING_TABLE.get(char, char)
return output, len(input)
You can also implement the decoding, but you need to reverse the _ENCODING_TABLE
,
the best practice is to prepare the reversed table and fill the missing characters later.
_DECODING_TABLE = {v: k for k, v in _ENCODING_TABLE.items()}
# missing characters...
def normalize_decode(input, errors='strict'):
output = u''
for char in input:
output += _DECODING_TABLE.get(char, char)
return output, len(input)
Now, everything is ready, you can implements the codec protocol:
class Codec(codecs.Codec):
def encode(self, input, errors='strict'):
return normalize_encode(input, errors)
def decode(self, input, errors='strict'):
return normalize_decode(input, errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
assert self.errors == 'strict'
return normalize_encode(input, self.errors)[0]
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
assert self.errors == 'strict'
return normalize_decode(input, self.errors)[0]
class StreamWriter(Codec, codecs.StreamWriter):
pass
class StreamReader(Codec, codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(name=NAME,
encode=normalize_encode,
decode=normalize_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamreader=StreamReader,
streamwriter=StreamWriter)
How to register the newly created codec?
If you have several normalisation codecs, the best practice is to gather
them in the __init__.py
file of a dedicated package
(for instance: my_app.encodings
.
# -*- coding: utf-8 -*-
import codecs
import normalize
def search_function(encoding):
if encoding == normalize.NAME:
return normalize.getregentry()
return None
# Register the search_function in the Python codec registry
codecs.register(search_function)
Whenever you need your codec, you write:
import my_app.encodings
normalize = my_app.encodings.normalize.NAME
def my_function():
normalized = my_string.encode(normalize)
The unified solution would be using a predefined dict of replacement pairs. Such a dict can be easily extended(modified).
The solution using re.complile
and re.sub
functions:
import re
d = {
u"\u2018" : "'", u"\u2019" : "'", u"\u2013" : "", u"\u2026" : ""
}
pattern = re.compile(r'(' + '|'.join(re.escape(k) for k in d.keys()) + ')')
replaced = pattern.sub(lambda c: d[c.group()], infile.lower())
Use the built-in string method translate. It takes a dictionary of Unicode ordinals as keys and translates to the values, which can be Unicode ordinals, strings or None. The value None deletes characters:
sample = '\u2018hello\u2019\u2013there\u2026'
print(sample)
replacements = { 0x2018 : "'",
0x2019 : "'",
0x2013 : '-',
0x2026 : '...' }
print(sample.translate(replacements))
Output:
‘hello’–there…
'hello'-there...
re.sub will do it too:
import re
tmp = re.sub(u'\u2019|\u2018', '\'', infile.lower())
tmp = re.sub(u'\u2013|\u2026', '', tmp)
来源:https://stackoverflow.com/questions/40690460/python-removing-extra-special-unicode-characters