Python removing extra special unicode characters

I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions.

I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble.

tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "")

for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed.

Things like this u\2013 EN DASH and this HORIZONTAL ELLIPSIS are definitely not needed.

Is there a way to remove those quotation marks and use simple standard quotes that won't break text processing 'with nltk' and remove things like those EN DASH, HORIZONTAL ELLIPSIS without making such a monster call like I see starting to rear it's head in the sample code above?

If your text is in English and you want to clean it up in a human-readable way, use the third-party module unidecode. It replaces a wide range of characters with their nearest ascii look-alike. Just apply unidecode.unidecode() to any string to make the substitutions:

from unidecode import unidecode
clean = unidecode(u'Some text: \u2018\u2019\u2013\u03a9')

If you need to do this kind of characters "normalisation" you may consider implementing a codec for the Codec registry.

The implementation is similar as the one proposed by @RomanPerekhrest with a table of substitution characters.

Implementing a codec

Import the codecs module, give a name to your codec (avoid existing names). Create the encoding table (the one you'll use when you do u"something".encode(...):

import codecs

NAME = "normalize"

_ENCODING_TABLE = {
    u'\u2002': u' ',
    u'\u2003': u' ',
    u'\u2004': u' ',
    u'\u2005': u' ',
    u'\u2006': u' ',
    u'\u2010': u'-',
    u'\u2011': u'-',
    u'\u2012': u'-',
    u'\u2013': u'-',
    u'\u2014': u'-',
    u'\u2015': u'-',
    u'\u2018': u"'",
    u'\u2019': u"'",
    u'\u201a': u"'",
    u'\u201b': u"'",
    u'\u201c': u'"',
    u'\u201d': u'"',
    u'\u201e': u'"',
    u'\u201f': u'"',
    }

The table above can "normalize" spaces, hyphens, quotation marks. This is where normalisation rules go…

Then, implement the function used to normalize your string:

def normalize_encode(input, errors='strict'):
    output = u''
    for char in input:
        output += _ENCODING_TABLE.get(char, char)
    return output, len(input)

You can also implement the decoding, but you need to reverse the _ENCODING_TABLE, the best practice is to prepare the reversed table and fill the missing characters later.

_DECODING_TABLE = {v: k for k, v in _ENCODING_TABLE.items()}
# missing characters...

def normalize_decode(input, errors='strict'):
    output = u''
    for char in input:
        output += _DECODING_TABLE.get(char, char)
    return output, len(input)

Now, everything is ready, you can implements the codec protocol:

class Codec(codecs.Codec):
    def encode(self, input, errors='strict'):
        return normalize_encode(input, errors)

    def decode(self, input, errors='strict'):
        return normalize_decode(input, errors)


class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        assert self.errors == 'strict'
        return normalize_encode(input, self.errors)[0]


class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        assert self.errors == 'strict'
        return normalize_decode(input, self.errors)[0]


class StreamWriter(Codec, codecs.StreamWriter):
    pass


class StreamReader(Codec, codecs.StreamReader):
    pass


def getregentry():
    return codecs.CodecInfo(name=NAME,
                            encode=normalize_encode,
                            decode=normalize_decode,
                            incrementalencoder=IncrementalEncoder,
                            incrementaldecoder=IncrementalDecoder,
                            streamreader=StreamReader,
                            streamwriter=StreamWriter)

How to register the newly created codec?

If you have several normalisation codecs, the best practice is to gather them in the __init__.py file of a dedicated package (for instance: my_app.encodings.

# -*- coding: utf-8 -*-
import codecs

import normalize


def search_function(encoding):
    if encoding == normalize.NAME:
        return normalize.getregentry()
    return None


# Register the search_function in the Python codec registry
codecs.register(search_function)

Whenever you need your codec, you write:

import my_app.encodings

normalize = my_app.encodings.normalize.NAME

def my_function():
    normalized = my_string.encode(normalize)

The unified solution would be using a predefined dict of replacement pairs. Such a dict can be easily extended(modified).
The solution using re.complile and re.sub functions:

import re

d = {
    u"\u2018" : "'", u"\u2019" : "'", u"\u2013" : "", u"\u2026" : ""
}

pattern = re.compile(r'(' + '|'.join(re.escape(k) for k in d.keys()) + ')')
replaced = pattern.sub(lambda c: d[c.group()], infile.lower())

Use the built-in string method translate. It takes a dictionary of Unicode ordinals as keys and translates to the values, which can be Unicode ordinals, strings or None. The value None deletes characters:

sample = '\u2018hello\u2019\u2013there\u2026'
print(sample)
replacements = { 0x2018 : "'",
                 0x2019 : "'",
                 0x2013 : '-',
                 0x2026 : '...' }
print(sample.translate(replacements))

Output:

‘hello’–there…
'hello'-there...

re.sub will do it too:

import re
tmp = re.sub(u'\u2019|\u2018', '\'', infile.lower())
tmp = re.sub(u'\u2013|\u2026', '', tmp)

来源：https://stackoverflow.com/questions/40690460/python-removing-extra-special-unicode-characters

标签

python

unicode

special-characters

nltk

text-processing