UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)

前端 未结 4 733
失恋的感觉
失恋的感觉 2020-12-03 21:41

First, some background: I\'m developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML template

相关标签:
4条回答
  • 2020-12-03 22:03

    Since you state:

    All of my (text) files are currently stored in UTF-8 with the BOM

    then use the 'utf-8-sig' codec to decode them:

    >>> s = u'Hello, world!'.encode('utf-8-sig')
    >>> s
    '\xef\xbb\xbfHello, world!'
    >>> s.decode('utf-8-sig')
    u'Hello, world!'
    

    It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

    0 讨论(0)
  • 2020-12-03 22:05

    The previously-accepted answer is WRONG.

    u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

    The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

    >>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
    >>> UNICODE_BOM
    u'\ufeff'
    >>>
    

    Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

    Here's a correct and typo/braino-resistant answer:

    Decode your input into unicode_str. Then do this:

    # If I mistype the following, it's very likely to cause a SyntaxError.
    UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
    if unicode_str and unicode_str[0] == UNICODE_BOM:
        unicode_str = unicode_str[1:]
    

    Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

    Update Unfortunately there seems to be no suitable named constant in the standard Python library.

    Alas, the codecs module provides only "a snare and a delusion":

    >>> import pprint, codecs
    >>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
    [('BOM', '\xff\xfe'),   #### aarrgghh!! ####
     ('BOM32_BE', '\xfe\xff'),
     ('BOM32_LE', '\xff\xfe'),
     ('BOM64_BE', '\x00\x00\xfe\xff'),
     ('BOM64_LE', '\xff\xfe\x00\x00'),
     ('BOM_BE', '\xfe\xff'),
     ('BOM_LE', '\xff\xfe'),
     ('BOM_UTF16', '\xff\xfe'),
     ('BOM_UTF16_BE', '\xfe\xff'),
     ('BOM_UTF16_LE', '\xff\xfe'),
     ('BOM_UTF32', '\xff\xfe\x00\x00'),
     ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
     ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
     ('BOM_UTF8', '\xef\xbb\xbf')]
    >>>
    

    Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

    Here verbatim unprettified from my own code is my solution to this:

    def check_for_bom(s):
        bom_info = (
            ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
            ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
            ('\xEF\xBB\xBF',     3, 'UTF-8'),
            ('\xFF\xFE',         2, 'UTF-16LE'),
            ('\xFE\xFF',         2, 'UTF-16BE'),
            )
        for sig, siglen, enc in bom_info:
            if s.startswith(sig):
                return enc, siglen
        return None, 0
    

    The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

    If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

    0 讨论(0)
  • 2020-12-03 22:24

    Check the first character after decoding to see if it's the BOM:

    if u.startswith(u'\ufeff'):
      u = u[1:]
    
    0 讨论(0)
  • 2020-12-03 22:25

    You can use something similar to remove BOM:

    import os, codecs
    def remove_bom_from_file(filename, newfilename):
        if os.path.isfile(filename):
            # open file
            f = open(filename,'rb')
    
            # read first 4 bytes
            header = f.read(4)
    
            # check if we have BOM...
            bom_len = 0
            encodings = [ ( codecs.BOM_UTF32, 4 ),
                ( codecs.BOM_UTF16, 2 ),
                ( codecs.BOM_UTF8, 3 ) ]
    
            # ... and remove appropriate number of bytes    
            for h, l in encodings:
                if header.startswith(h):
                    bom_len = l
                    break
            f.seek(0)
            f.read(bom_len)
    
            # copy the rest of file
            contents = f.read() 
            nf = open(newfilename)
            nf.write(contents)
            nf.close()
    
    0 讨论(0)
提交回复
热议问题