Remove non-ASCII characters from a string using python / django

前端 未结 6 467
情歌与酒
情歌与酒 2020-12-05 19:11

I have a string of HTML stored in a database. Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB it

相关标签:
6条回答
  • 2020-12-05 19:47

    There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481

    To remove non-ASCII characters from a string, s, use:

    s = s.encode('ascii',errors='ignore')

    Then convert it from bytes back to a string using:

    s = s.decode()

    This all using Python 3.6

    0 讨论(0)
  • 2020-12-05 19:49

    This code snippet may help you.

    #!/usr/bin/env python
    # -*- coding: UTF-8 -*-
    
    def removeNonAscii(string):
        nonascii = bytearray(range(0x80, 0x100))
        return string.translate(None, nonascii)
    
    nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)
    

    The encoding definition is very important here which is done in the second line.

    0 讨论(0)
  • 2020-12-05 19:56

    I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.

    def unicode_escape(unistr):
        """
        Tidys up unicode entities into HTML friendly entities
    
        Takes a unicode string as an argument
    
        Returns a unicode string
        """
        import htmlentitydefs
        escaped = ""
    
        for char in unistr:
            if ord(char) in htmlentitydefs.codepoint2name:
                name = htmlentitydefs.codepoint2name.get(ord(char))
                entity = htmlentitydefs.name2codepoint.get(name)
                escaped +="&#" + str(entity)
    
            else:
                escaped += char
    
        return escaped
    

    Use it like this

    >>> from zack.utilities import unicode_escape
    >>> unicode_escape(u'such as ® I want')
    u'such as &#174 I want'
    
    0 讨论(0)
  • 2020-12-05 20:07

    You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range

    # -*- coding: utf-8 -*-
    
    def strip_non_ascii(string):
        ''' Returns the string without non ASCII characters'''
        stripped = (c for c in string if 0 < ord(c) < 127)
        return ''.join(stripped)
    
    
    test = u'éáé123456tgreáé@€'
    print test
    print strip_non_ascii(test)
    

    Result

    éáé123456tgreáé@€
    123456tgre@
    

    Please note that @ is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table

    EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.

    0 讨论(0)
  • 2020-12-05 20:07

    You shouldn't have anything to do, as Django will automatically escape characters :

    see : http://docs.djangoproject.com/en/dev/topics/templates/#id2

    0 讨论(0)
  • 2020-12-05 20:10

    To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape:

    import cgi
    test = "1 < 4 & 4 > 1"
    cgi.escape(test)
    

    Will return:

    '1 &lt; 4 &amp; 4 &gt; 1'
    

    This is probably the bare minimum you need to avoid problem. For more you have to know the encoding of your string. If it fit the encoding of your html document you don't have to do something more. If not you have to convert to the correct encoding.

    test = test.decode("cp1252").encode("utf8")
    

    Supposing that your string was cp1252 and that your html document is utf8

    0 讨论(0)
提交回复
热议问题