UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

后端 未结 8 1279
谎友^
谎友^ 2020-11-30 00:46

how does the unicode thing works on python2? i just dont get it.

here i download data from a server and parse it for JSON.

Traceback (most recent cal         


        
8条回答
  •  悲&欢浪女
    2020-11-30 01:50

    The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.

    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data

    Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from

    data = data.decode(encoding)

    to

    data = data.decode("ISO-8859-1")

    solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.

提交回复
热议问题