utf8 codec can't decode byte 0x96 in python

前端未结

关注

 3  466

被撕碎了的回忆 2020-12-05 07:34

I am trying to check if a certain word is on a page for many sites. The script runs fine for say 15 sites and then it stops.

UnicodeDecodeError: \'utf8\' co

3条回答

遥遥无期 (楼主)

2020-12-05 07:56

The byte at 15344 is 0x96. Presumably at position 15343 there is either a single-byte encoding of a character, or the last byte of a multiple-byte encoding, making 15344 the start of a character. 0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding.

Hence the stream is either not UTF-8 or else is corrupted.

Examining the URI you link to, we find the header:

Content-Type: text/html

Since there is no encoding stated, we should use the default for HTTP, which is ISO-8859-1 (aka "Latin 1").

Examining the content we find the line:

Which is a fall-back mechanism for people who are, for some reason, unable to set their HTTP headings correctly. This time we are explicitly told the character encoding is ISO-8859-1.

As such, there's no reason to expect reading it as UTF-8 to work.

For extra fun though, when we consider that in ISO-8859-1 0x96 encodes U+0096 which is the control character "START OF GUARDED AREA" we find that ISO-8859-1 isn't correct either. It seems the people creating the page made a similar error to yourself.

From context, it would seem that they actually used Windows-1252, as in that encoding 0x96 encodes U+2013 (EN-DASH, looks like –).

So, to parse this particular page you want to decode in Windows-1252.

More generally, you want to examine headers when picking character encodings, and while it would perhaps be incorrect in this case (or perhaps not, more than a few "ISO-8859-1" codecs are actually Windows-1252), you'll be correct more often. You still need to have something catch failures like this by reading with a fallback. The decode method takes a second parameter called errors. The default is 'strict', but you can also have 'ignore', 'replace', 'xmlcharrefreplace' (not appropriate), 'backslashreplace' (not appropriate) and you can register your own fallback handler with codecs.register_error().

0 讨论(0)

查看其它3个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复